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DECLARATION IN RESPONSE TO THE REQUIREMENT OF 37 CFR S L105 

This declaration is prepared in response to the 37 C.F.R. § 1.105 requirement in the 
Office Action mailed on October 21, 2004. 

I, the Declarant, am one of the named inventors on U.S. Patent Application No. 
10/055,178, in which an office action was mailed on October 21 , 2004. The Office Action 
includes a 105 requirement relating to (i) PC-NAS and (ii) products and services of Language 
Analysis Systems, Inc. ("LAS*0- 

The present application, as*filed, included a paragraph asserting to describe PC-NAS, 
stating in part that 'The assignee has developed a software program known as PC-NAS. An 
early version of this program was incoqx>rat6d into a government computer system more than 
one year before the priority date of this application.*' Specification at page 5, lines 11-13. This 
statement is inaccurate for at least the reason that PC-NAS was not incorporated into a 
government computer system more than one year before the priority date of this ^plication. 
Because the statement was inaccurate, I, through my patent attorneys, removed the PC-NAS 
paragraph fi-om the specification in an amendment filed on April 12, 2004. 

An investigation into LAS's products and services was performed that involved me, at 
least one other individual at LAS, and my patent attorneys. The piupose of the investigation was 
to detemiine which, if any, of LAS's products and services were prior art to the present 
application or should otherwise be disclosed to the U.S. Patent & Trademark Office C'PTO")« In 
the course of the investigation, we detemiined that PC-NAS had never been disclosed outside of 
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During the investigation, however, we did detennine that four products/services of LAS 
should be disclosed to the PTO. These four are: (i) Arabic Name Classifier, (ii) Arabic Name 
Analyzer, (iii) Consular Lookout And Support System, and (iv) Distributed Name Check* These 
four products/services are each described and disclosed in another Declaration by me that was 
filed in the present case on July 13, 2004, as part of an Information Disclosure Statement. 

I am aware that the website www.archive.org (the "archive website") has a number of 
documents purporting to be archives of the LAS website on various dates. I do not know 
whethCT or not these documents are accurate. The archive website includes four documents, and 
only four, that are dated prior to March 25, 1998, which is the priority date of the present 
application. Thedate$ofthefourdocuments,accoidingtothearchive website, are: 2/1/1997, 
7/1 1/1997, 10/21/1997, and 2/6/1998. The Examiner notes in the 105 requirement that the 
10/21/1 997 document on the archive website describes a Suite of Tools including NameCheck, 
NameCiassifier, NameRegularizer, Intelligent Search Data Generator, and PhoneticNameKey 
tools. The Examiner requests disclosure relating to these tools. 

During the investigation I reviewed printouts of all four of the www.archive.6rg 
documents dated prior to March 25, 1998. During my review, I noticed that the oldest two 
documents did not contain material describing the Suite of Tools mentioned by the Examiner. 
That is, the Suite of Tools first appeared in the 10/21/1997 document, and does not appear in 
either the 2/1/1997 document or the 7/1 1/1997 document. Accordingly, the Suite of Tools is not 
in a document having a date one year prior to the priority date. Further, to the best of my 
knowledge, the Suite of Tools was not disclosed outside of LAS more than one year prior to 
March 25, 1998. 

I hereby declare that all statements made herein of my own knowledge are true and that 
all statements made on information and belief are believed to be true; and further that these 
statements were made with the knowledge that willful false statements and the like so made are 
punishable by fine or imprisonment, or both, xmder Section 1001 of Title 1 8 of the United States 
Code and that such willful false statements may jeopardize the validity of the application or any 
patents issued thereon. 
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DECLARATION OF JOHN CHRISTIAN HERMANSEN 

This declaration relates to the foEpwing systems: Arabic Name Classifier CANC^^ 
Arabic Name Analyzer C*ANA'% Consdar Lookout Aild Support System C^^^ 



Distributed Nanie Check CDNC^. 

To the best of Declarant's recolIecitioDi^ ANC was written as a design document and 
delivered to a customer no later than the end of 1 996. 

ANC accepted a romanized input name and a COB associated widi the jnpw name^ and 
produced a binary result indicating ^eth^ the ii^ut name was considered to be Aisibic. 
Specifically, ANC detennined a single surname for the input name, and compared that surname 
against a list of surnames that w^re known both to be from the COB and to be Arabic. If tfiere 
was an exact spelling match, then ANC det^mined that the input name was Arabic and reported 
this determination to a user. If there was not an exact match, then ANC (i) perform snd a digram 
analysis on the input surname to detemiine the digrams present, (ii) produced an ini licator of the 
similarity between the digram analyns and digram results for Arabic surnames finorn the COB, 
(iii) compared the value of the indicator to a threshold value representing confidence in the 
similarity and, based on this comparisont produced a binary resuh indicating whefh str the input 
name was considered to be Arabic, and (iv) reported the binary result to a user. 



AM 

To the best of Declarant's recollection, ANA was written as a design docunr.ent and 
delivered to a customer no later than the end of 1 996. ANA accepted a romanized input name 
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known to be Arabic, and applied various modification nUes to a single surname of the input 
name. The ndes were based on known spelling dififerences in 

offheruks produced a resulting surname which could be differratfi^ , 
name. ANA then produced a key representing the Te5dtingsiuname,m^ 
to puU names ^tmi a database. 

CLASS 

To the best of Declarant's recollection, no later than the end of 1991 (i) acccudingto the . 
temas of a contract witis the United States government, abd for condensation, Langtxage AiEialysia 
Systems provided a design to the United States govenunent in the United States of .America 
proposing linguistics processing feature Iot C^ 

between another party (not Language Analysis Systons) and the United States goviounent, and 
for compensation, CLASS was implemented in software by the other party, with thu 
implementation generally foUowing the proposed design from Language Analysis Systems, and 
CLASS was provided to the United States government in the United States of Ameica, and (iii) 
CLASS was opiated on a mamframe in the United States of America and accessed by tenninds 
m one or more foreign countries. 

CLASS accepted an input name and detemiined a rank-ordered list of namo t £fom a 
database, where the names in the list were considered to be possible matches ibr tht^ input name. . 
More specifically, CLASS: 

(1) received the input name and various related or coixesponding inputs including one or 
more ''compressed name*" COT") key(s) for corresponding sumame(s), a conespobding COB, a 
corresponding date of birth CVOB^% and possibly a conresponding state of bulh, 

(2) identified component elements of the input name (e.g., surname and given name), and 
identified a first initial of the given name, 

(3) identified digrams within each 8q)arate component element of the input name Clnput 
name digrams**)t 

(4) derived a set of names from within a database for comparison to the input name, the 
set of names being derived based on the input name, the one or more CN keys, the DOB, and the 
first initial of the given name. 
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(5) identified digrams for the component elements of the names in the set of nances * 
C'database name digrams*"). 

(6) selected a set of weighting mles for producing a score tndicati^ 

two names matched each oflier, (he set of weighting niles being selected based on die COB of the 
tiiputname» 

(7) compared the input name with each name in the set of names, Urn con^)s rlson 
including comparing the input name digrams to the database name digrams^ « * 

(8) generated a metric for each name in the set of names by applying the set of wdgjiting 
rales during the comparison of the input name ^th each name in the set of namea, 

(9) rank-ordered all names in the set of namea having a metric greater than a threshold 
"scor^^e fhiresh^^^^ From the set 
of names, and ' • 

(10) provided the rank-ordered names to the user. 

The set of weighting rules assigned various points to a particular name in the set of names 
based on a con^arison of the particular name and the input name. Far example» various points 
might be assigned depending on whether (i) corresponding element(s) in the particular name and 
the ii^ut name had similar digram results, (ii) the length of one or more elements ^ as the same 
in the particular oame and (he input name, (iii) Ihe DOBs of the particular name ami theijq[>iit 
name were wifldn a predetermined timeframe of each other, (iv) the COB of tiie inr ut name was 
iht same as the COB associated with the particular name, (v) the elements of the particular name 
and the ii^ut name were in the same order, and (vi) the state of biitfa was the same .'ior both the 
particular name and the input name. 

ESC 

To the best of Declarant's recollection, no later than February 1997 (i) aecording to the 
terais of a contract with the United States government and for compensation, DNC was 
developed by Language Analysis Systems as a computear program and delivered by I^guage 
Analysis Systems to the United States govemmoit in the United States of America, and (ii) DNC 
was operated in one or more foreign countries. DNC was similar to CLASS, as described above, 
except that (i) DNC did not receive or use a key for the $umame(s) of the ii^ut nanrie, (ii) DNC 



Applicaot : JolmChristiaaHeniuiiMDetaL 

SttialNo. : 09/273.766 

Fikd : Maidi2S.1999 

Page : 4of4 



1 «iU«l 1 W«l/' U (.CO 



Attomey'sDocketNo. 164414)12001 



and 



derived the set of names based on the DOB, the COB, 
without reference to a key or the first initial of the given jname, 
coo^utex and not on a mainframe, so that when operatec 
personal coxiq>uter in the foreign country and not on a 
America. 



the state of birth (if available), and 

!, and (iii) DNC ran o» a personal 
in a foreign county DNC only ran on a , 
mkhiframe in the Uihed States of 



I ho'eby declare that all statem^ts made herein < f my own knowledge are true and that 
all statements made on infonnation and belief are believe id to be true; and fiuiher that these 
statements were made with the knowledge that willful fa Ise statements and the like so made are 
-puaishable^y^e^r^mprisonment. or bodii under Section 1001 of Titie-IS-of^e^inited States 



Code and that such willful false statements may jecpaidfise the validity of the sjiplication or any 
patents issued (hereon.' I 
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SOFTWARE DESIGN DESCRIPTION 
AUTOMATIC NAME CLASSIFIER FOR 
CLASS-E (ANC-E) 

I. INTRODUCTION 

^^tiiis>l. Project Background 

1.1.1. Legacy Consular Lookout And Support System (CLASS) and 
CLASS-E 

The Consular Lookout and Support System (CLASS) performs 
namechecks of visa and passport applicants in support of the issuance 
process. Used by United States passport agencies, consulates, and border 
inspection agencies, CLASS serves as an automated index to manual files, 
CLASS is a centraHzed system residing on mainframe computers at the 
Department of State in Washington. DC. The Bureau of Consular Affairs, 
Consular Systems Division (CA/EX/CSD) of the Department of State 
(DOS) has responsibility for development, maintenance, and operation of 
CLASS. 

CLASS was implemented in 1989; since that time, major advancements 
have occurred in database management systems, large-scale computers and 
their operating systems, and data telecommunications. In addition, name- 
matching techniques have also evolved based on the DOS's experience 
with the system and further linguistic research. This has led DOS in 
determining the necessity for a newer, more modernized system. 
CLASS-E (Consular Lookout and Support System-Enhanced). 

- The CLASS-E modernized version of automated name-matching will 
incorporate state-of-the-art hardware, data telecommunications, and 
database management technology to migrate the CLASS application from 
its Virtual Storage Access Method (VSAM) environment into a DB2 
relational database system. In addition to providing virtually uninterrupted 
access to the lookout databases 24 hours a day, 7 days a week to the VO, 
PPT, overseas posts :and support users, this enhanced system will position 
CLASS-E to incorporate advanced culturally-sensitive namecheck 
methods. 

1 . 1.2. Culturally Sensitive Name Searching in CLASS-E 
Personal naming systems vary widely from culture to culture. That is, 
names from around the world do not necessarily fit cleanly into the 
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Anglophone name model. Several of the manifestations of these 
differences are 

• Anglicization of Non-English sound patterns (Mladevic written as 
Miladevich) ^ . . 

• Variant romanization schemes (Arabic Wascem - Ouassime, Shareef - 
Cherife; Chinese Xia - Hsia - Sya 

• Dialectal variants (Arabic Abu Bakir [Egyptian] - Boubker [Moroccan]; 
Chinese Wu [Mandarin] - Ng [Cantonese. Fukien]) 

• Variant roman spelling conventions (French silent letters, German sch for 

English sh) 

When, dealing with Arabic and Chinese names and those of other 
languages that do not use the Roman alphabet, foe example, one quickly 
discovers one major source of name variation lies in how names are 
transliterated into roman characters from the original scripts. For both 
Arabic and Chinese, there are numerous competing transliteration 
standards, as well as less formal traditions. Xia, Hsia, and Sya, for 
example, are all romanized variants of the same Chinese name. Kassim, 
Qasim, Casern, Kacem and Asim are romanized variants of the same 
Arabic name. In Arabic, name variation often goes beyond the phonetic 
level. Analyzable elements such as "Abu" show up in many different 
forms, depending on dialect (e.g., Abu Bakir - Boubker). In Chinese, 
multiple traditions of transliteration are one of the sources of name 
. variation; dialect issues also abound (e.g., Wu - Ng). Hispanic names, 
which make up the largest portion of the data base, place information 
value on name parts in a manner that is not consistent with Anglophone 
naming conventions. Exploitation of this culturally-specific information 
in the name search process leads to improved precision, recall, and overall 
system performance. 

1.1.3. Automatic Name Classifier-E (ANC-E) in CLASS-E 
The need for automatic name classification has become a become an 
undisputed necessary first step in the process of applying linguistic ♦ 
knowledge to solve the problems associated with nam& searching in large 
multicultural databases. In this environment, name classification serves as 
a means of routing queries to the proper language- and culture-specific 
algorithms. Currently, Legacy CLASS supports a single module, called 
ANI, which begins to address this need by returning a Boolean value 
indicating whether a name is or is not Arabic. If a name qualifies as Arab, 
it is subject to processing by an initial implementation of the Arabic 
algorithm designed by LAS for the State Department. Currently the 
expanding needs of the State Department are being addressed in the 
development a second culture-specific algorithm which will handle 
Hispanic names. The addition of ^ Hispanic algorithm to CLASS'S 
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functionality requires the addition of a method for identifying Hispanic 
names in a manner similar that of ANI. 

At this juncture, it is reasonable to turn enhancement efforts towards the 
development of a single, integrated, expandable algorithm for name 
classification which will address the need for classifying Arabic and 
Hispanic names, and which will anticipate the imminent addition of other 
languages. The integrated automatic name classification algorithm will 
represent a significant improvement over the existing ANI algorithm in 
that it will incorporate more linguistic knowledge, it will allow for future 
expansion with minimal coding effort, and it will allow information about 
a record^s country of birth (COB) to contribute to the query routing 
decision. Figure 1-1 displays the integration of ANC-E within the 
CLASS-E system. 



ANC-E in CLASS-E 
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Figure l-l 
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1 .2. Scope 



This document describes the linguistic motivation, requirements, and high level 
design for an Automatic Name Classifier (ANC) which will automatically 
determine whether a name qualifies as Hispanic or Arabic. The document's 
purpose is to provide information about the proposed design in order to facilitate 
^^.^the analysis and planning necessary to prepare for eventual implementation. 

Intended to serve as the module that will provide for the integration of the 
enhanced Arabic Name Search Algorithm for CLASS-E (ANA-E) and the 
Hispanic Name Search Algorithm (HNA-E) into the overall CLASS-E 
architecture, the Automatic Name Classifier for CLASS-E (ANC-E) will provide 
the capability to automatically determine whether an input name is Arabic. 
Hispanic, or neither. In this system, names may be qualified as Arabic or 
Hispanic by virtue of passing one of two thresholds, or, conversely, may be 
disqualified as Arabic or Hispanic by virtue of having many characteristics of 
*Other* types of names. The ANC-E system has been designed with an open 
architecture intended to facilitate the inclusion of additional cultures in the event 
that CLASS-E adds other culture-specific search algorithms in the future. 
Furthermore, since the ANC-E is data-driven, it is possible to tune its level of 
sensitivity for each individual culture being identified. 

In CLASS-E the concept of the Ugacy CLASS Multi-Pipe Architecture will be 
' carried forward to include a distinct Arabic processing algorithm and a distinct 
Hispanic processing algorithm as well as perhaps others in the future. The type of 
processing to which an input name will be submitted will be a business decision 
of CA/EX/CSD and may to some degree be dependent on the impact that multiple 
processing of an input name would have on the performance of the system. It is 
likely that input names that are classified by the Advanced Name Classifier for 
CLASS-E (ANC-E) will be submitted to multiple of the following processors: the 
generic CLASS-E generic processing algorithm, the DOB processing algorithm, 
the ANA-E algorithm, and the HNA-E algorithm. The ANC-E will provide a 
determination as to which culture or cultures a name belongs; what use is made of 
• this determination is a business decision of CA/EX/CSD. This decision will 
affect the design of the interface between the ANC-E and the rest of the CLASS-E 
system. 

1 .3. Definitions and Acronyms 
1.3.1. Definitions 

Affix' A name particle which is neither a title nor a qualifier. Affixes in the 

ANC-E are defined as being delineated by white space; for example. 
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Digraph 
Field 



Given Name 
Infix 

Morpheme* 



Morphology 
Name 



'de' in Tirso de Molina'. Note that, contrary to normal usage within 
linguistics, affixes are in contrast to (bound) morphemes, which are 
not delineated by white space. 
A two character n-gram. 

A data entry mechanism which allows the user to input a fixed 
number of characters. The fields typically referred to in the CLASS 
environment are the Given Name Field and the Surname Field. 

Note that it is important to distinguish between given name and 
surname data entry Fields and given name and surname data 
elements.. since data elements do not always occur in the proper field. 
The portion of a name which uniquely identifies an individual 
member of a family, as opposed to surname. Given Names may 
include one or more segments; for example, 'Mary Jane* in 'Mary 
Jane Cassoway'. 

A substring occurring the middle of a name segment, but not at the 
edges. Both n-grams and morphemes may be infixes, 
(here, bound morpheme) A meaningful, variable length substring of a 
name segment. Morphemes may occur as prefixes, infixes or suffixes. 
Examples: *-ovitch* in *Berkovitch'. Note that morphemes contrast 
with affixes. 

Referring to morphemes. 

The general term referring to the entire collection of segments which 
refer to a single person. A name may include one or more given 
names, one or more surnames and zero or more particles. For the 
purposes of ANC-E, a Name is considered to consist only of 
alphabetic characters and white space. The diagram below illustrates 
the relation of name parts to one another: 

Name 



N-Gram 



Surname 
I 

Compound 
Namel Name2 



Stem 



Complex 



Given Name 

Name 
Simple 



Prefix Stem 
Mohariimad Abdel Rahmen Jawad 

A variable length sequence of characters which serves as a useful 



' Note that these terms have a slightly modified or restricted dennition within the context of ANC-E. 
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indicator of linguistic affinity, but which is not associated with a 
meaning. N-Grams may be considered to be indicators of the sound 
or spelling patterns of a language; for example, -ez is a Hispanic N- 
Gram. 

Particle A functional name element delineated by white space. Titles, affixes 

and qualifiers are the three kinds of particles identified in the ANC-E 
algorithm. 

Prefix A substring (N-Gram or morpheme) or a particle (affix) occurring at 

the beginning of a name segment. 
Qualigei: A meaningful particle which represents a kinship relation or earned 

social status; for example, Jr. or Ph.D. Qualifiers typically occur at 

the end of a name field. 
Segment Any element within a name which is delineated by white space. 

Suffix A substring (N-Gram or morpheme) or a particle (affix) occurring at 

the end of a name segment. 
Surname The portion of a name which may indicate family membership, as 

opposed to given name. Surnames may include one or more segments 

and zero or more particles', for example, 'Fernandez de la Puente* in 

*Hector Fernandez de la Puente'. 
Syntax The rules governing the order of name elements. 

Title A meaningful particle which represents a term of address and which 

typically occurs at the beginning of a name field. Examples: Dr. or 

Sir. Titles may be indicative of social position. 
Trigraph A three character n-gram. 

Variant An alternate spelling of a name segment; for example, Mohammad 

and Muhamed are variants of one another. Variants may be 
predictable, as in this example, or unpredictable, as evidenced by 
typographical or other data entry errors. 



1.3.2. Acronyms 

ANA Legacy Arabic Namecheck Algorithm 

ANA-E Arabic Namecheck Algorithm for CLASS-E 

ANC-E Automatic Name Classifier for CLASS-E 

ANI Arabic Name Identification (of Legacy CLASS ANA) 

ANR Arabic Name Regularization (of Legacy CLASS ANA) 

AOR Application Owning Region 

ARTP Acceptance/Regression Test Plan 

ARTR Acceptance/Regression Test Report 

BIMC Beltsville Information Management Center 

C/CE CLASS to CLASS-E 

CA Bureau of Consular Affairs 
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CA/EX/CSD 


Consular Affairs, Consular Systems Division 


CAX 


Consular Affairs Experimental (Development) 


CCB 


Configuration Control Board 


CCR 


Contiguration v-narrgc i\.cquc:>i 


CDD 


Critical Design Document 


CDR 


Critical Design Review 


Cc 


CLAio-cnnancea 


CICS 


Customer Information Control System 






CLASS-E 


Consular Lookout and Support System-Enhanced 


CM ^ 


Configuration Management 


CMOS 


Complementary Metal Oxide Semiconductor 


COB 


Country of Birth 


POP 


Pnntrarttncr Office Reoresentativc 


CSD 


Computer Systems Division 


DBMS 


Database Management System 


DB2 


IBM*s relational database 


DIA 


Digraph Information Aggregator (of ANC-E) 


DNC 


Distributed Namecheck 


DOB 


Date of Birth 


DOS 


Department of State 


FRR 


Functional Requirements Review 


FRS 


Functional Requirements Specification 


HNA-E 


Hispanic Namecheck Algorithm for CLASS-E 


IBIS 


Interagency Border Inspection System 


IDP1/IDP2 


Intermediate Decision Processor 1 / 2 (of ANC-E) 


IP 


Installation Plan 


IW 


Independent Verification and Validation 


LIA 


Linguistic Information Aggregator (of ANC-E) 


LID 


Linguistically Informed Decision Processor (of ANC-E) 


LQA 


Linguistic Quality Assurance 


LQAR 


Linguistic Quality Assurance Report 


LSP 


Linguistic Support Plan 


LTF 


Linguistic Trace Facility 


NC 


Namecheck 


PC 


Production Control 


PMP 


Project Management Plan , . 


PPP 


Post Phase-In Plan 
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Passport Office 

Parallel Transaction Server 

Quality Assurance 

Query Management Facility 

Query Routing Processor 

State Annex- 1 

Software Engineering Standards and Procedures 

Title, Affix Qualifier 
Test Incident Report 

Tenninal Owning Region 
Test Readiness Review 

Visa Office 

Virtual Storage Access Method 



2. References 

2. 1 . CLASS-E Project Management Plan (PMP) 

2.2. CLASS-E Functional Requirements Specification (FRS) 
2.2. 1 . Note: the CLASS-E FRS has not yet been finalized. 



PPT 
PTS 

QA 

QMF 

QRP 

SA-1 

TAQ.. 
TIR 
TOR 
TRR 

VO 

VSAM 
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3. 



Decomposition Description 
3.1. Module Decomposition 



ANC-E Module Decomposition 



UNOWSHCAILV WFORMED 



FMAL DECISION 
PROCESSOR 



UWUISTK INFOnUATION 
MMREOATOR. 



MTERUEOUTE DECISION 
PROCESSOR t 
(UOOECIStON) 



DUaRAPKWFORMATXM 
AOaREOATOR 



mERMEOMTC DECISION 
PROCESSORS 
(DIORAPHOECIStON) 



HP NAME PROCESSOR 



TAQ PROCESSOR 



NORAU PROCESSOR 



Figure 3-1 



3.1.1. Automatic Name Classifier for CLASS-E (ANC-E) Module 
Decomposition ^ 

3.1.1.1. Identification 

This program is referred to as the Automatic Name 
Classifier for CLASS-E (ANC-E), 

3.1.1.2. Type 

ANC-E is a program that is part of the larger pLASS-E 
system. It can be viewed as a "shell" program in that it is to 



ANC-E 

Language Analysis Systems. Inc. 



9 



03/19/98 



serve as a layer surrounding all of the culturally-specific 
name search algorithms implemented in CLASS-E. 

3.1.1.3. Purpose 

3.1.1.3.1. The need for automatic name classification is a 
necessary first step in the process of applying 
linguistic knowledge to solve the problems 
associated with name searching in large 
multicultural databases. 

3.1.1.3.2. In the CLASS-E environment, name 

classification serves as a means of routing 
queries to the proper language- and culture-, 
specific algorithms. 

3.1.1.3.3. In addition to the rudimentary identification of 
Arabic names currently implemented in ANI, 
the addition of a Hispanic name search 
algorithm to CLASS-E*s functionality requires 
the addition of a method for identifying 
Hispanic names. 

3.1.1.3.4i ANC-E is a single, integrated algorithm for 

name classification which will address the need 
for classifying Arabic and Hispanic names, and 
which will anticipate the possible addition of 
other languages. 

3.1.1.3.5. This integrated automatic name classification 
algorithm will represent a significant 
improvement over the existing ANI algorithm in 
that it will incorporate more linguistic 
knowledge, and will allow information about a 
record's country of birth (COB) to contribute to 
the query routing decision. 

3.1.1.4. Function 

3. 1 . 1 .4. 1 . The ANC-E will take as input a surname, given 
name, and COB in standard CLASS-E formal. 

3.1.1.4.1.1.' There are two options with respect to 
the methodology for handling an input 
name and gathering the aggregate data 
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that will lead to the determination of 
cultural affinity for that name. 
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3.1.1.4.1.1.1. If ANC-E is to be 

implemented in an object- 
oriented environment, an 
object can be created which 
will contain all of the 
accumulated information to be 
used in the determination of 
cultural affinity. This object 
travels through the ANC-E 
system, thus allovi/ing access 
to the accumulated 
information at any time. If 
ANC-E is integrated with the 
culturally-sensitive name 
search algorithms in 
CLASS-E, this option has the 
advantage that the all of the 
attendant linguistic 
information produced by 
ANC-E processing could be 
passed, along with the nanic, 
to the culturally-sensitive 
namecheck algorithm for 
further processing. That is, 
certain common linguistic 
processing would need to be 
performed only one time for 
the entire namecheck process, 
rather than once for each 
specific name search ■ 
^ algorithm mvokcd. 

3. 1 . 1 .4. 1 . 1 .2. If ANC-E is to be 

implemented in a non-object- 
oriented environment, ANC-E 
will process the name and 
COB as separate string values, 
and will output a either a 
single cultural affinity 
indicator (e.g. Arabic, 
Hispanic, or Other) or three 
Boolean values, one for each 

01/19/98 



culture under consideration, 
depending on the business 
decision nnade by 
CA/EXyCSD. If this option is 
chosen, linguistic processing 
information and scoring 
internal to ANC-E will not be 
available to outside processes. 

3.1.1 .4.2. The ANC-E will provide a determination as to 
which culture or cultures a name belongs. 

3.1.1 .4.3. The use that is made of the cultural affinity 
determinations made by ANC-E is a business 
decision of CA/EX/CSD (i.e. whether to allow a 
name to be processed by more than one 
namecheck algorithm, and whether ANC-E shall 
return more than one possible cultural affinity 
for a given input name). This decision will 
affect the design of the interface between the 
ANC-E and the rest of the CLASS-E system. 

3.1.1.5. Subordinates 

The following processes are subordinate to the main 
ANC-E program: 

• The Linguistically Informed Decision Processor (LID) 

• The Digraph Distribution Processor 

• The Final Decision Processor. 



3.1.2. Linguistically Informed Decision (LID) Module Decomposition 

3.1.2.1. Identification 

This module is referred to as the Linguistically Informed 
Decision Processor (LID). 

3.1.2.2. Type 

The LID is a module which contains two subordinate 
modules. The first subordinate module performs linguistic 
analysis, gathering linguistic information and scoring for 
the input name. The second subordinate module makes 
decisions as to the cultural affinity of the name, based on 
the scoring information gathered by the first module. 
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3.1.2.3. Purpose 

3.1.2.3.1. The LID exists to provide a linguistically well- 
founded decision as to the cultural affinity of the 
input name, 

3. 1 .2.3.2. As the first phase of processing, the LED 
addresses performance requirements by basing 
this decision on multiple readily observable 

^^tv^T... linguistic factors, thus obviating the need for 

processing by the more intensive statistical 
digraph model and for reliance on name-external 
factors, such as Country of Birth (COB). 

3.1.2.3.3. Furthermore, the LID provides a more 
linguistically-rich context in which to determine 
the cultural affinity of the input name than does 
it' s purely digraph-dislribution-based 
predecessor, ANL Thus ANC-E is better able to 
identify names that are Hispanic or Arabic and 
to eliminate those that are not. Linguistic 
Indicators provide a rich source of information 
about the cultural affinity of a name. The LID 

^ processor will serve as a means of assuring that 

names which are strongly Arabic or Hispanic are 
qualified and, conversely, that names which 
have strong characteristics of some other culture 
are disqualified. Names which qualify as 
Hispanic, Arabic or 'Other' will not be 
submitted to the Digraph Analysis function. 

3.1.2.4. Function ' . . * 



3.1.2.4.1. All linguistic indicator processing will take place 
before digraph analysis and wil! constitute a 
linguistically informed decision (LID) 
mechanism. 

3.1.2.4.2. The LID accumulates and weighs factors from 
multiple knowledge sources in order to determine 
whether there is a sufficient amount of evidence 
to iden;ify the input name as being Hispanic or 
Arabic, or, conversely, if there is enough 
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evidence to discount the possibility that the input 
name is either Hispanic or Arabic. 

The LID wiii assign points to a name based on a 
weighted tabulation of scores from the 
following data sources: 

• High Frequency name data 

• TAQ data 

• Morphological data 

• Ngram data 

The function of the LID is to determine a score 
for each cultural affinity being classified, and a 
score for 'Other'. For each culture, a name 
must get a score which passes its corresponding 
LID Threshold in order to be labeled as Arabic, 
Hispanic or "Other". 

Each of the four types of linguistic indicator 
(listed in 3.1.2.4.3) will be associated with a set 
of four parameters, indicating the weight that a 
LID element is to be given. 

The score for each language group will be 
calculated as a summation of the combination of . 
the applicable factor times the score for each 
.indicator found in the name string. Scoring 
details are included in the decomposition 
descriptions of the respective modules. (See 
sections 3.1.3 -3.1.8.) 

3.1.2.4.7. After all of the agents have processed the input 
name, the LID combines the detailed scoring 
information returned by the LIA to produce a 
LID score for Hispanic, Arabic, and for Other. 

3.1.2.4.8. The LID passes the LID score to the Intermediate 
Decision Processor 1 for comparison to LID 
thresholds for cultures under consideration. 

3. 1 .2.4.9. There are two alternatives for the output of the 
processing of the input name performed by the 
LID: an'object containing linguistic processing 
information and scores or three Boolean values 
indicating whether the name has passed the LID 



3.1.2.4.3. 



3.1.2.4.4. 



3.1.2.4.5. 



3.1.2.4.6. 
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thresholds for Arabic, Hispanic, or Other. For 
more information, see 3. IJ .4. 1, 1 . 

3. 1 .2,4. 10. If the LID identifies a name as Hispanic, 

Arabic, or Other (or any combination thereof), 
no further processing is required. 

3. L2.4.11. For a detailed example of LID processing, see 
the figures in Appendix A. 

3.1.2.5. Subordinates 

The following processes are subordinate to the LID: 

• The Linguistic Information Aggregator (LIA) 

• Intermediate Decision Processor 1 (LID Decision). 

3. 1 .3. Linguistic Information Aggregator (LIA) Module Decomposition 

3.1.3.1. Identification 

This module is referred to as the Linguistic Information 
Aggregator (LIA). 

3.1.3.2. Type 

LIA is a module which contains four subordinate functions 
(agents) all of which contribute to the fmal decision or 
decisions made by the LID as to the cultural affinity of the 
input name. Thus, conceptually, LIA and the LID can be 
viewed as parts of a blackboard (voting) system, an expert 
system, or as parts of a system with multiple intelligent 
agents. 

3.1.3.3. Purpose 

3.1.3.3.1. The LIA exists to enable the linguistic decision 
made by the LID. The LIA controls the flow of 
information from the four linguistic agents 
subordinate to it. 



3. 1 .3,3.2. If the implementation choices accompanying the 
object-oriented description of ANC-E are 
chosen (see 3. 1 . 1 .4. 1 . 1 . 1 ), LIA could help 
performance by allowing certain linguistic 
processing to occur only once for each name 
check, rather than once for each algorithm 
invoked. (Note: In Legacy CLASS each 
algorithm is referred to as a separate *pipe' .) 
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3.1.3.4. Function 



3.1.3.4.1. LIA accumulates linguistic information factors 
from multiple knowledge sources for each 
culture under consideration (i.e. currently 
Hispanic, Arabic, and Other). 

3. 1 .3.4.2. In cases of conflict, the order of precedence for 
identifying items within an input name is 

""^'^■^ TAQ particle. Morpheme, Ngram. 



3.1.3.4.2.1. If a string of letters is identified as a 
TAQ particle for a particular culture, a 
substring of that same string (including 
the entire string itself) cannot also be 
identified as a Morpheme or an Ngram 
for that same culture. 

3.1.3.4.2.2. If a string is identified as a Morpheme 
for a particular culture, the characters 
that make up that Morpheme cannot 
also be considered as part of an Ngram 
for that culture. 



3. 1 .3.4.2.3. HF Names from a given culture can 
contain Morphemes and / or Ngrams 
for that same culture; however, the 
precedence rules in sections 3.1.3.4.2.1 
and 3. 1.3.4.2.2 apply. 

3.1.3.4.3. As the subordinate functions (agents) process the 
input name, detailed scoring information is 
collected by LIA, and weighted according to its 
information value as indicated in the LID 
Parameter data store. 



3.1.3.4.4. After all of the agents have provided their input. 

the LIA returns this detailed scoring information 
to the LID. 



3. 1 .3.4.5. For a detailed example of aggregation of 
information by LIA. see the figures in 
Appendix A. 
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3.1.3.5. Subordinates 

The following processes are subordinate to the LIA: 

• The High Frequency (HF) Name Processor 

• The Title, Affix. Qualifier (TAQ) Processor 

• The Morphological Processor 

• The Ngram Processor. 

3.1.4. High Frequency (HF) Name Processor Module Decomposition 

3.1.4.1. Identification 

This function is referred to as HF Name Processor. 

3.1.4.2. Type 

The HF Name Processor is a function which is invoked by 
the Linguistic Information Aggregator (LIA). 

3.1.4.3. Purpose 

Certain given names and surnames occur much more 
frequently in some cultures than in others. The name 
"Mohammed", for example occurs frequently in Arabic 
names. The surname "Rodriguez" lends support to the 
possibility that the name in question is Hispanic, The name 
"Nganga" in any position suggests that the name might not 
be either Arabic or Hispanic. The HF Name Processor 
exists to take advantage of the information available in high 
frequency names in the culmral identification of the name. 

3.1.4.4. Function 



3. 1 .4.4. 1 . For each name segment present in the input 
name, the HF Name Processor determines 
whether that name is present in the HF Name 
data store. 

3. 1 .4.4.2. If the name is present in the HF name data store, 
the HF Name Processor retrieves and records 
the culture, name field (given name or surname), 
and score associated with that name from the 
data store. 

3.1.4.4.3. Also recorded for each HF name found is 
whether it was found in position or out of 
position. For example, since "Rodriguez" is 
listed a surname in the HF Names data store, 
if it is found in the GN field in the input name, it 
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is reported as a surname considered to be out of 
position. 

3.1.4.4.4. The HF Name Processor tracks scoring 

information for each HF name found, and returns 
this detailed scoring information to LIA. 

3.1.4,5. Subordinates 
None. 

3.1.5. Title, Affix. Qualifier (TAQ) Processor Module Decomposition 

3.1.5.1. Identification 

This function is referred to as the TAQ Processor. 

3.1.5.2. Type 

The TAQ Processor is a function which is invoked by the 
Linguistic Information Aggregator (LIA). 

3.1.5.3. Purpose 

As noted in section 1.3.1, name fields have a syntactic 
structure which may be simple, compound, complex, or 
compound-complex. Name fields which are complex or 
compound-complex contain particles: titles, affixes, or 
qualifiers. These particles can be used to further narrow the 
range of possibilities for the cultural affinity of the input ^ 
name. The TAQ Processor exists to make use of the 
information available in panicles. 

3.1.5.4. Function 

3.1 .5.4. 1 . For each segment present in the input name, the 
TAQ Processor determines whether that 
segment is a particle present in the TAQ data 
store. 

3. 1 .5.4.2. If the segment is present in the TAQ data store, 
the TAQ Processor retrieves and records the 
culture, name field (given name or surname), 
and score associated with that TAQ particle 
from the data store. 
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3.1.5.4.3. Also recorded for each TAQ particle found is 
whether it was found in position or out of 
position. (See example in section 3. 1 .4.4.3.) 

3.1.5.4.4. The TAQ Processor tracks scoring information 
for each HF TAQ particle found, and returns this 
detailed scoring information to LIA. 

3.1.5.5. Subordinates 



3. 1 .6. Morphological Processor Module Decomposition 

3.1.6.1. Identification 

This function is referred to as Morphological Processor. 

3.1.6.2. Type 

The Morphological Processor is a function which is 
invoiced by the Linguistic Information Aggregator (LIA). 

3.1.6.3. Purpose 

As noted and defined in section 1.3.1, morphological 
elements, such as -ovich, can play a large part in 
determining the cultural affinity of an input name. The 
Morphological Processor exists to take advantage of this 
information in the name classification process. 

3.1.6.4. Function 

3. 1 .6.4. 1 . For each Morpheme present in the Morphology 
data store, the Morphological Processor 
determines whether that Morpheme is present in 
the input name. 

3. 1 .6.4. 1.1. Note that the above processing differs 
from that in the HF Name Processor 
(3.1.4.4) and the TAQ Processor 
(3.1,5.4). Since the Morphology data 
store contains only bound Morphemes, 
that is Morphemes not surrounded by 
white space, it is not possible to locate 
them based on name segments, which 
' are surrounded by white space. 
Rather, it is necessary to determine if 
any of the items listed in the 
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Morphology data store is a substring of 
any of the name segments present in 
the input name, within certain 
constraints. For more detailed 
information on identifying Morphemes 
in the input name, see sections 3.2.4 
(Morphological Data Store Data 
Decomposition) and 3.1.6 
(Morphological Processor Module 
Decomposition). 

3. 1 .6.4.2. For each Morpheme found in the input name, 
the Morphological Processor retrieves and 
records the morpheme found, the culture, name 
field (given name or surname), and score 
associated with that Morpheme from the data 
store. 

3.1.6.4.3. Also recorded for each Morpheme found is 
whether it was found in position or out of 
position. (See example in section 3.1.4.4.3.) 

3.1.6.4.4. The Morphological Processor tracks scoring 
information for each Morpheme found, and 
returns this detailed scoring information to LIA. 

3.1,6.5. Subordinates 
None. 

3.1.7. Ngram Processor Module Decomposition 

3.1.7.1. Identification 

This function is referred to as the Ngram Processor 

3.1.7.2. Type 

The Ngram Processor is a function which is invoked by the 
Linguistic Information Aggregator (LIA). 

3.1.7.3. Purpose 

As described in section 1.3.1. Ngrams are strings of letters 
that occur with statistical significance in names with a 
given cultural affinity. The Ngram Processor exists to take 
advantage of this s'tatistical phenomenon in the name typing 
process. 



ANC-E 

Language Analysis Systems. Inc. 



20 



03/19/98 



3.1.7.4. Function 

3.1 .7.4. 1 . For each Ngram present in the Ngram data store, 
the Ngram Processor determines whether that 

Ngram is present in the input name. 

3. 1 .7.4. 1 . 1 . Note that the above processing is 

similar to that in the Morphological 

Processor. (See section 3.1.6.4, and • 

especially section 3. 1 .6.4. 1 . 1 for a 

detailed note.) 

3. 1 .7.4.2. For each Ngram found in the input name, the 
Ngram Processor retrieves and records the 
Ngram found, the culture, name field (given 
name or surname), and score associated with 
that Ngram from the data store. 

3 . 1 .7 .4.3. Also recorded for each Ngram found is whether 
it was found in position or out of position. (See 
example in section 3. 1 .4.4.3.) 

3.1.7.4.4. The Ngram Processor tracks scoring information 
for each Ngram found, and returns this detailed 
scoring information to LIA. 



3.1.7.5. Subordinates 
None. 

3.1.8. Intermediate Decision Processor 1 (LID Decision) Module 
Decomposition 

3.1.8.1. Identification 

This module is referred to as Intermediate Decision 
Processor 1 (IDPl). 

3.1.8.2. Type 

IDPl is a function which is invoked directly by the 
Linguistically Informed Decision Processor (LID). 

•3.1.8.3. Purpose 

IDPl is the decision-making function of the LID. It 
determines whethe'r enough linguistic information has been 
gathered from the various intelligent agents by LIA to 
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confidently determine that the input name belongs to one of 
the cultures being identified (currently Arabic, Hispanic, 

and Other). 



3.1.8.4. Function 

IDPI accepts as input one aggregate LID score 
for each culture being identified as well as an 
aggregate LID score for Other. 

For each LID score, IDPI compares that score to 
the LID threshold for the appropriate culture (or 

Other). 

If the LID score is greater than or equal to the 
appropriate LID threshold, IDPI returns a value 
of True for the culture in question. If the LID 
score is less than the LID threshold for the 
culture in question, IDPI returns a value of 
False for the culture in question. 

3.1.8.4.3.1. A True value indicates to the LID that 
enough evidence has been accumulated 
by LIA to confidently identify the name 
as belonging to the culture in question, 

3.1.8.4.3.2. A False value indicates to the LID that 
not enough evidence has been 
accumulated by LIA to confidently 
identify the name as belonging to the 
culture in question. 

3.1.8.4.3.3. A value of True can be returned for 
more than one cultural affinity. 

3.1.8.4.3.4. A value of False may be returned for all 
cultural affinities. 

3.1.8.4.4. Alternatively, IDPI could return a value for 
each culture equal to the LID score minus the 
LID threshold for that culture. 

3. 1 .8.4.4. 1 . Given the alternative above, the LID 
would interpret negative scores as 



3.1.8.4.1. 



3.1.8.4.2. 



3.1.8.4.3. 
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False values and nonnegative scores as 
True values. 

3.1.8.4.4.2. The utility of this alternative is that if 
an object-oriented implementation is 
chosen, the values calculated by IDPl 
could be incorporated into the object 
mentioned in section 3. 1 . 1 .4. 1 , 1 . 1 , and 
would be available as part of the 
information that the name object 
*'knows" about itself for use in later 
processing. 

3. 1 .8.4.5. If a return value of True for any culture (or for 
"Other") is obtained from IDPl, no further 
processing is required. 

3.1.8.5. Subordinates 
None. 

3. 1 .9. Digraph Distribution Processor Module Decomposition 

3.1.9.1. Identification 

This module is referred to as the Digraph Distribution 
Processor. 

3.1.9.2. Type 

The Digraph Distribution Processor is a module which has 
two subordinate functions. 

3.1.9.3. Purpose 

The Arabic Name Identification (ANI) subprogram ' 
currently in use in Legacy CLASS is based purely on a 
model of digraph distribution in Arabic names. Digraph 
distribution information has proved useful in determining 
the cultural affinity of names. Based on a statistical model 
generated from digraph distribution statistics and initial and 
final trigraph statistics, the Digraph Distribution Processor 
lends additional information to the attempt to identify the 
provenance of the input name. 

3.1.9.4. Function 

3. 1 .9.4. 1 . The Digraph Distribution Processor takes as 
input the surname from the name input to 
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ANC-E. This portion of ANC-E operates only 
on surname data. 

3 J. 9.4.2. The Digraph Distribution Processor is invoked 
only when the LED has not been successful in 
assigning any cultural affinity to the input name. 
(See sections 3.1.2.4.10 and 3.1.8.4.5.) 

3.1 .9.4.3. The Digraph Distribution Processor calculates 
scores based on digraph distribution statistics 
for each culture in order to determine whether 
there is a sufficient amount of evidence to 
identify the input name as being Hispanic or 
Arabic. Note that there is no Digraph 
Distribution Score computed for Other. 

The Digraph Distribution Parameters data store 
contains a Digraph Skew Factor for each 
cultural affinity. 

The Total Digraph Distribution Score for the 
input name is equal to the Raw Digraph 
Distribution Score returned by the DIA plus the 
value of the Digraph Skew Factor for the 
appropriate culture. 

3.1.9.4.6. The Digraph Distribution Processor passes the 
Total Digraph Distribution score for each culture 
to the Intermediate Decision Processor 2 for 
comparison to Digraph thresholds for cultures 
under consideration. 

3.1.9.4.7. There are two alternatives for the output of the . 
processing of the input name'performed by the 
Digraph Distribution Processor: an object 
containing a Digraph Distribution Score for each 
culture, or two Boolean values indicating 
whether the name has passed the Digraph 
thresholds for Arabic or Hispanic. For more 
information, see 3. 1 . 1 .4. 1 . 1 . 

3.1.9.4.8. If the Digraph Distribution Processor identifies a 
name as Hispanic, Arabic, or both no further 
processing is required. 



3.1.9.4.4. 



3.1.9.4.5. 
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3.1.9.5. Subordinates 

The following processes are subordinate to the Digraph 
Distribution Processor: 

• The Digraph Information Aggregator (DIA) 

• Intermediate Decision Processor 2 (Digraph Decision). 



3.1.10. Digraph Information Aggregator (DIA) Module Decomposition 

3.1.10.1. Identification 

This module is referred to as the Digraph Information 
Aggregator (DIA). 

3.1.10.2. Type 

The DIA is a process invoked by the Digraph Distribution 
Processor. The DIA operates only on surname segments consisting 
solely of alphabetic characters. 



3.1.10.3, Purpose 

The DIA gathers the information necessary for the Digraph 
Distribution Processor to determine whether there is 
sufficient information to identify the input name as 
Hispanic or Arabic. 

3.1.10.4. Function 



3.1.10.4.1. For purposes of DIA processing, a surname 
segment is defined as any string of characters 
delimited by white space. 

3.1.10.4.1.1. Given a surname containing more 
than one part as input, the name is 
segmented (based on white space). 
Each part of multi-part surnames is 
processed separately, and the scores 
are combined in the manner described 
below. 



3. 1 .10.4.2. DIA will calculate a score for each surname 
segment by totaling the scores for all digraphs 
within the surname segment. 

r 

3. 1 .10.4.2. 1 . The set of digraphs for a surname 

consists of all possible substrings of 
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two contiguous characters in the 
surname. 

Word-boundaries are considered 
characters, so the additional digraphs 
''word'boundary+first-letter*" and 
"last'letter+word-boundary'* are 
included in the set of digraphs for 
each name. 

In general, a surname segment of 
length n contains digraphs, 
ordered from leftmost to rightmost. 

3.1.10.4.3. Each digraph in the surname segment is looked 
up in a table containing scores for all possible 
digraphs for all cultural affinities being scored. 
DIA maintains a cumulative total of all digraph 
scores assigned to a surname segment. 

3.1.10.4.4. Likewise, scores are assigned for the initial and 
final trigraphs of each name segment. 

3.1.10.4.5. The initial and final trigraph scores are added to 
the cumulative score for that segment. A score is 
thus calculated for each segment of the siirname. 

3.1.10.4.6. The Raw Digraph Distribution Score for the 
input name is equal to the sum of all individual 
surname segment scores thus calculated. 

3.1.10.5. Subordinates 
None. 

3.1.1 1. Intermediate Decision Processor 2 (Digraph Decision) Module 
Decomposition 

3.1.1 1 . 1 . Identification 

This module is referred to as Intermediate Decision 
Processor 2 (IDP2). 

3.1.1 1.2. Type 

IDP2 is a function'which is invoked directly by the Digraph 
Distribution Processor. 



3.1.10.4.2.2. 



3.1.10.4.2.3. 
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3.1.11.3. Purpose 

IDP2 is the decision-making function of the Digraph 
Distribution Processor. It determines whether enough 
digraph distribution information is present to confidently 
determine that the input name belongs to one of the cultures 
being identified (currently Arabic or Hispanic). 

3.1.11.4. Function 

3. 1 . 1 1 .4. 1 . IDP2 accepts as input one Digraph Distribution 
Score for each culture being identified. 

3.1.1 1.4.2. For each Digraph Distribution Score, IDP2 
compares that score to the Digraph threshold for 
the appropriate culture. 

3.1.1 1.4.3. If the Digraph Distribution Score is greater than 
or equal to the appropriate Digraph threshold. 
IDP2 returns a value of True for the culture in 
question. If the Digraph Distribution Score is 
less than the Digraph threshold for the culture in 
question, IDP2 returns a value of False for the 
culture in question. 

3.1.11 .4.3. 1 .A True value indicates to the Digraph 
Distribution Processor that digraph 
distribution information is conclusive 
enough to confidently identify the name 
as belonging to the culture in question. 

3.1.1 1.4.3.2. A False value indicates to the Digraph 
Distribution Processor that digraph 
distribution information is not 
conclusive enough to confidently 
identify the name as belonging to the 
culture in question. 

3.1.11 .4.3.3. A value of True can be returned for 
more than one cultural affinity. 

3.1.11 .4.3.4. A value of False may be relumed for all 
, cultural affinities. 

3.1.1 1.4.4. Alternatively, IDP2 could retum a value for 
each culture equal to the Digraph Distribution 
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Score minus the Digraph threshold for that 
culture, . 

3.1.1 1.4.4.1. Given the alternative above, the 
Digraph Distribution Processor would 
interpret negative scores as False 
values and nonnegaiive scores as True 
values. 

3.1.11.4.4.2. The utility of this alternative is that if 
an object-oriented implementation is 
chosen, the values calculated by IDP2 
could be incorporated into the object 
mentioned in section 3. 1 . 1 .4. 1 , 1 . 1 , and 
would be available as part of the 
information that the name object 
"knows" about itself for use in later 
processing. 



3.1.11.5. Subordinates 
None. 

3.1.12. Final Decision Processor Module Decomposition 

3.1.12.1. Identification 

This module is referred to as the Final Decision Processor. 

3.1.12.2. Type 

The Final Decision Processor is a module invoked directly 
by the ANC-E main program. 

3.1.12.3. Purpose 

3.1. 12.3. 1 . Although the LID and the Digraph Distribution 
Processor are each powerful methods for 
identifying the cultural affinity of names in 
themselves, some benefit can be gained from 
combining the judgments of these two modules 
when neither has been successful in reaching a 
conclusion within a reasonable level of 
certainty on its own. 

3.1.1 2.3.2. Additionally, within the CLASS-E system, 
information about the Country of Birth (COB) 
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will usually be available. Although this 
information is not generally sufficient to 
determine the cultural affmity of a name in 
itself, it could provide the additional evidence 
necessary to reach a conclusion when 
combined with the judgments of the LID and 
the Digraph Distribution Processor. 

3.1 .12.3.3. The final decision processor exists to take all of 
this information into account, in an effort to 
■^sy^*. determine the cultural affmity of the input 

name by combining all available data when the 
individual data elements themselves are not 
strong enough indicators. 

3.1.12.4. Function 

3.1.12.4.1. In the event that neither the LID nor the 
Digraph Distribution Processor is successful in 
determining a cultural affinity for the input 
name, the processing continues to the Final 
Decision Processor, (See sections 3.1.2,4.10» 
3.1.8.4.5, and 3.1.9.4.8.) 

3.1.12.4.2. If the options suggested in sections 
3.1.1.4.1.1.1, 3.1.8.4.4, and 3.1,11.4.4 are 
incorporated into the implementation, the final 
Decision Processor will have access to all of 
the information it needs to perform its task 
encapsulated in the name information object. 
Otherwise, the Final Decision Processor will 
take as input LID scores (for each cultural 
affinity and for Other) and digraph scores (for 
each cultural affinity) for the input name. 

3.1.12.4.3. For each culture still under consideration, the 
final decision processor will determine if the 
Digraph Distribution score for that culture is' 
within the range specified by the 
Under_Di_Threshold parameterV Note that 
since there is no Digraph Distribution score 
calculated for the cultural affinity "Other", 

■ / 
' For additional informaiion regarding the range specified by the Under.Di.Thrcshold parameter, see . 

section 3.2.9.4.2.5. 
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there is no Under_Di_Threshold parameter 
associated with Other, and this processing 
applies only to cultures included in the current 
name classifier (e.g. Arabic and Hispanic). 

In the event that the Digraph 
Distribution score is in the range 
specified for the particular culture, 
processing continues to determine if 
there is enough additional evidence to 
identify the input name as belonging 
to that culture. 

3,1.12.4.3.2. In the event that the Digraph 

Distribution score is not in the range 
specified for the particular culture, 
that cultural affinity is removed from 
further consideration for the input 
name. 

3.1.12.4.4. For each culture still under consideration, the 
final decision processor will determine if the 
LID score for that culture is within the range 
specified by the Under_LID_Threshold 
parameter*. 

3.1.1 2.4.4. 1 . In the event that the LID score is in 
the range specified for the particular 
culture, the final decision processor 
will identify the input name as 
belonging to that culture. . 

3.1.1 2.4.4.2. In the event that the LID score is not 
in the range specified for the 
particular culture, processing 
continues to determine if there is 
enough additional evidence to identify 
the input name as belonging to that 
culture. 

3.1.12.4.5. For each culture still under consideration, the 
Final Decision Processor determines whether 



' For more information regarding the range specified by the Under.LID.Threshold parameter, see section 
3.2.9.4:2.4. 
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the COB supplied with the input name is in the 
partition associated with the cuUuraUffm^^^ as 
defined in the COB Proximity (COBPROX) 
— Data Store. 

3 1 p 4 5 1. In the event that the COB supplied 

with the input name is in the partition . 
associated with the cultural affinity 
under consideration, the final decision 
processor will identify the input name 
as belonging to that culture. 

3 1 12 4 5.2. In the event that the COB supplied 
with the query is not in the partition 
associated with the cultural affmity 
under consideration, that cultural 
affinity is removed from further 
consideration for the input name. 

3.1.12.4.5.3. In the event that the COB supplied 

with the input name is Unknown (i.e. 

"XXX" in Legacy CLASS), the Final 
Decision Processor will identify the 
input name as belonging to the 
cultural affinity under consideration. 
Note that this is a conscious decision 
to err on the side of recall in the 
absence of adequate information (that 
is, to identify a name as belonging to 
a culture, perhaps erroneously, in an 
effort to avoid erroneously not 
identifying some input names as 
belonging to that culture). This is 
related to the other policy decisions to 
be made by CA, and may change 
based on those decisions. 

3 1 12 4 6 A summarization of the processing performed 
by the final decision processor is contained in 
Figure 3-2. 
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IF (Di.Threshold - Digraph.Distribution^Score - Under_Di_Threshold >= 0) AND 
((LID.Threshold - LID.Score - Under_LID_Threshold >= 0) OR 
(COB_of Jnput^name is in partition OR COBlofJnput _name is Unknown)) 

THEN 

Identify Input Name as belonging to the culture in question 
END IF : 

Figure 3-2 

3.1.12.5. Subordinates 
None. 
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3.2. 



Data Decomposition 

The data tables which underlie the Linguistically Informed Decision 
processor are crucial to the success of the algorithm. As discussed in 
3.1.2.4.3, the linguistic data to be used are: High Frequency names, TAQ 
elements, Morphological elements and Ngrams. The entries for each of 
these linguistic sources will be associated, minimally, with a name field, a 
cultural group and a score. Also associated with the LID are control 
parameters. The Data entities accessed by the LED, as well as by other 
ANC-E Modules are depicted in Figure 3-3. This section describes in 
detail the data stores used by ANC-E. For examples of the type of 
information to be included in the data stores, see the detailed example in 
Appendix A. 



LID Puaroeter 



HFNamea 



Ngraph 



oatt stored CT data storeZ Si 



TAQ data itore 



Digraph " 
Distribunofi 



ingrapn 
Dinribution ^ 



Hispanic: 
Arabic: 



{T.F} 
{T.F} 



Figure 3-3 
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Threshold 
Parameter 
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3.2. 1 . LID Parameter Data Store Data Decomposition 

3.2.1.1. Identification 

This data store is referred to as the LID Parameter Data 
Store. 

3.2. L2. Type 

The LID Parameter Data Store is a data store that is 
accessed by the LID module. 

3.2.1.3. Purpose 

3.2. 1.3.1. The LID takes factors such as HF names, 

Ngrams, TAQ particles, and Morphemes into 
account in determining the cultural affinity of 
the input name. 



Although each of these factors is valuable, they 
should not all be given the same relative weight 
in determining the cultural affinity score of the 

input name. 

Furthermore, as in all real-world applications, 
the data in the CLASS-E database is not 
"clean". That is, data elements are not always 
found in the expected positions. Therefore, it 
is common to find surname elements in the 
given name field, and vice versa. Since it is 
not always possible to determine whether a 
particular instance of "out-of-field" data is due 
to random factors influencing data entry 
procedures or to a name's being from a culture 
other than die one hypothesized, data found 
"out of position" should not be given as great a 
weight as data found in the canonical position. 

3.2. 1 .3.4. The LID Parameter data store exists in order to 
allow for different weighting of evidence found 
by the LID based on the above factors without 
hard-coding the exact weighting scheme itself 
in the LID. This will allow for runtime fine- 
tuning and adjustments to ANC-E without the 
necessity of recompiling LID module code. 



3.2.1.3.2. 



3,2.1.3.3. 
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3.2.1.4. Function 

3.2. 1 .4. 1 . Table 3.2-1 contains a description of the data to 
be contained in the LID Parameter data store. 



DATA NAME 


DATA TYPE 


DATA 
WIDTH 


POSSIBLE 
VALUES 


AGENT_NAME 


character 


10 


{HFNAME. 
TAQ, 
MORPHOLOGY, 
NCR AM) 


^^NAMEFIELD 


character 


1 


IG.S) 


INFIELD.SCORE 


integer 


2 


(L2 10} 


OUT OF FIELD SCORE 


integer 


2 


n.2..... 10) 



Table 3.2-1 



3,2. 1 .4.2. The LID uses the information provided in this 
data store when calculating aggregate cultural 
affinity scores from the detailed scoring 
information returned by LIA. 



3.2. 1 .4.2. 1 . AGENT_NAME indicates to which 
agent (function) the given 
INFIELD.SCORE and 
OUT_OF_FIELD^SCORE weightings 
apply. 

3.2.L4.2.2. NAMEFIELD indicates whether the 
INFIELD.SCORE and • 
OUT_OF^FIELD_SCORE weightings 
apply to the Given Name (G) or to the 

Surname (S). 

3.2. 1 .4.2.3. IN_FIELD_SCORE is the weighting to 
be applied to data elements' raw scores 
returned by the specified agent when 
found in the specified name field. 

3.2. 1 .4.2.4. OUT_OF_FIELD_SCORE is the 
weighting to be applied to data 
elements' raw scores returned by the 
specified agent when found out of the 
specified name field. 
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3.2.L4.2.4.1.' For more information 

concerning IN_FIELD and 
OUT_OF_FIELD attributes 
returned from individual 
agents via LIA, sec 3.1.4.4.3. 

3.2. 1 .4.2.4.2. For an example of scoring of 
an input name using raw 
scores returned by agents and 
the LID Parameters, see 
Figure 3-4. 

3.2.1.5. Subordinates 
None. 

5.2.2. High Frequency Name Data Store Data Decomposition 

3.2.2.1. Identification 

This data store is referred to as the HF Name Data Store. 

3.2.2.2. Type 

The HF Name Data Store is a data store that is accessed by 
the HF Name Processor. 

3.2.2.3. Purpose 

The HF Name Data Store encodes the knowledge necessary 
for the HF Name Processor function of the LID to add 
information needed for the cultural identification of the 
input name. 

3.2.2.4. Function 



3.2.2.4. 1 . Table 3.2-2 contains a description of the data to 
be contained in the HF Name data store. 



DATA NAME 


DATATYPE 


DATA 
WIDTH 


POSSIBLE 
VALUES 


NAME 


character 


24 


* 


NAMEFIELD 


character 


1 


{G,S) 


SCORE 


integer 


1 


(1,2.3.4.5) 


CULTURE 


character 


1 


{H, A.O) 



Table 3.2-2 
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3.2.2.4.2. The HF Name Processor uses the information 
provided in this data store when gathering 
detailed HF name cultural affinity information . 
to be returned to LIA. High frequency given 
names and surnames for each of the three target 
cultural groups will be listed in the high 
frequency data store. 

3.2.2.4.2.1. NAME indicates the literal string 
representation of the HF name. 

3.2.2.4.2.2. NAMEFIELD indicates whether the 
score listed for the HF name applies to 
the Given Name (G) or to the Surname 

(S). 

3.2.2.4.2.3. SCORE reflects the degree to which a 
name may be considered high 
frequency within the culture in 
question, and is the score assigned by 
the HF Name Processor when the HF 
name listed is found in the input name. 
For processing details, see section 
3.1.4, High Frequency (HF) Name 
Processor Module Decomposition. 

3.2.2.4.2.4. CULTURE indicates the cultural 
affinity with which the given NAME- 
NAMEFIELD-SCORE combination is 

associated. 



3.2.2.4.2.5. A HF name string may appear in the 

HF Names Data Store multiple times if 
it is associated with multiple cultural 
affinities, or if it associated with a 
different frequency score in the given 
name and surname,. In this instance, 
the correct score must be assigned for 
each CULTURE, NAMEFILED 
combination associated with the HF 
name in question. 

3.2.2.4.2.5. 1 . For an example of scoring of 
an input name using raw 
scores returned by agents and 
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the LID Parameters, see 
Figure 3-4. 

3.2.2.5. Subordinates 
None. 

3.2.3. TAQ Data Store Data Decomposition 

3.2.3.1. Identification 

This data store is referred to as the TAQ Data Store. 

3.2.3.2. Type 

The TAQ Data Store is a data store that is accessed by the 
TAQ Processor. 

3.2.3.3. PuqDOse 

The TAQ Data Store encodes the knowledge necessary for 
the TAQ Processor function of the LID to add information 
needed for the cultural identification of the input name. 

3.2.3.4. Function 

3.2.3.4. 1 . Table 3.2-3 contains a description of the data to 
be contained in the TAQ data store. 



DATA NAME 


DATA TYPE 


DATA 
WIDTH 


POSSIBLE 
VALUES 


TAQ 


character 


24 


« 


NAMEFIELD 


character 


1 


(G.S.B) 


SCORE 


integer 


10 


(1,2.3,4,5) 


CULTURE 


. integer 


3 


1.. 1,000 ■ 



Table 3.2-3 

3.2.3.4.2. The TAQ Processor uses the information 
provided in this data store when gathering 
detailed TAQ - cultural affinity information to 
be returned to LIA. TAQ values for each of the 
three target cultural groups will be listed in the 
TAQ data store. 



3.2.3.4.2.1.^ TAQ indicates the literal string 

representation of the Title, Affix or 
Qualifier particle. Note that only free 
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Morphemes are included in the TAQ 
data store, so, by definition, all TAQs 
are implicitly bounded by white space. 

3.2.3.4.2.2. NAMEFIELD indicates whether the 
score listed for the given TAQ p?irticle 
applies to the Given Name (G), to the 
Surname (S), or to Both (B). 

3.2.3.4.2.2. 1 . In the event that the 

NAMEFIELD is listed as "B". the 
associated TAQ is defined as "in 
position" whether it is found in the 
given name or in the surname field in 
the input name, and is scored 
accordingly. 

3.2.3.4.2.3. SCORE is a score for the given TAQ- 
NAMEFEELD-CULTURE 
combination. The TAQ scores will 
reflect the predictive valiie of the TAQ 
particle for the culture with which it is 
associated. This is the score assigned 
by the TAQ Processor when the TAQ 
particle listed is found in the input 
name. For processing details, see 
section 3.1.5, Title, Affix, Qualifier 
(TAQ) Processor Module 
Decomposition. 

3.2.3.4.2.4. CULTURE indicates the cultural . 
affinity with which the given TAQ- 
NAMEFIELD-SCORE combination is 
associated. 

3.2.3.4.2.5. A TAQ particle may appear in the 
TAQ Data Store multiple times if it is 
associated with multiple cultural - 
affinities. In this instance, the correct 
score must be assigned for each 
cultural affinity associated with the 
TAQ value in question. 

3.2.3.4.2.5, 1 . For an example of scoring of 
an input name using raw 
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scores relumed by agents and 
the LID Parameters, see 
Figure 3-4. 

3.2.3.5. Subordinates 
None. 

3.2.4. Morphological Data Store Data Decomposition 

3.2.4.1. Identification 

This data store is referred to as the Morphological Data 

Store. 

3.2.4.2. Type ' . 
The Morphological Data Store is a data store that is 
accessed by the Morphological Processor. 

3.2.4.3. Purpose 

The Morphological Data Store encodes the knowledge 
necessary for the Morphological Processor function of the 
LID to intelligently process the input name, evaluating 
evidence based on culturally-specific Morphemes^ and 
adding this to information needed for the cultural 
identification of the input name. 

3.2.4.4. Function 

3.2.4.4.1. Table 3.2-4 contains a description of the data to 
be contained in the Morphology data store. 



DATA NAME 


DATA TYPE 


DATA WIDTH 


POSSIBLE 
VALUES 


MORPHEME 


character 


24 


* 


NAMEFIELD 


character 


1 


(G. S.B) 


MORHTYPE 


character 


1 


(E. P. S.I. A) 


SCORE 


integer 


1 


{1,2.3.4.51 


CULTURE 


character 


1 


|A. H.O) 



Table 3.2-4 



3.2.4.4.2. The Morphological Processor uses the 

information provided in this data store when 
gathering detailed Morpheme - cultural affinity 
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information to be returned to LIA. Morpheme 
values for each of the three target cultural 
groups will be listed in the Morphological Data 
Store. 



3.2.4.4.2. 1 . MORPHEME indicates the literal 
string representation of the Morpheme. 
Note that only bound Morphemes are 
included in the Morphological data 
store, so, by definition, all Morphemes 
are intended to be located as substrings 
of individual segments of the input 
name. 

3.2.4.4.2.2. NAMEFIELD indicates whether the 
score listed for the given Morpheme 
applies to the Given Name (G), to the 
Surname (S), or to Both (B). 



3.2.4.4.2.2, 1 . In the event that the 

NAMEFIELD is listed as "B", the 
associated MORPHEME is defined 
as "in position" whether it is found 
in the given name or in the surname 
field in the input name, and is scored 
accordingly. 

3.2.4.4.2.3. MORPHTYPE indicates the linguistic 
distribution of the MORPHEME. 

3.2.4.4.2.3. 1 . Prefixes (P) are substrings 
which begin in the first . 
character- position of a name 
segment. 

3.2.4.4.2.3.2. INFIXES (I) are substrings 
which begin in a character 
position in the name segment 
which is not the first, and end 
in a character position in the 
name segment that is not the 
last. They are substrings that 

, are neither at the beginning 

nor the end of the name 
segment. 
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3.2.4.4.2.3.3. SUFFIXES (S) are substrings 
which end in the final 
character position of a name 
segment. 

3.2.4.4.2.3.4. A MORPHEME for which 
the MORPHTYPE is 
indicated as EDGE (E) can be 
found as either a PREFIX or 
a SUFFIX in a name segment 
in the input name. 

3.2.4.4.2.3.5. A MORPHEME for which 
the MORPHTYPE is 
indicated as ALL (A) can be 
found anywhere in a name 
segment in the input name. 



3.2.4.4.2.3.6. MORPHEMEs that are found 
in positions other than those 
indicated by the 
corresponding 
MORPHTYPE are not 
assigned any points for the 
purpose of identifying the 
cultural affinity of the input 
name. 

3.2.4.4.2.4. SCORE is a score for the given 
MORPHEME-NAMEFIELD- 
MORPHTYPE-CULTURE 
combination. The MORPHEME 
scores will reflect the predictive value 
of the Morpheme for the culmre with 
which it is associated. This is the 
score assigned by the Morphological 
Processor when the Morpheme listed is 
found in the input name. For 
processing details, see section 3.L6, 
Morphological Processor Module 
Decomposition. 

3.2.4.4.2.5. CULTURE indicates the cultural 
affinity with which the given 
MORPHEME-MORPHTYPE- 
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NAMEFIELD-SCORE combination is 
associated. 

3.2.4.4.2.6. A Morpheme may appear in the 

- Morphological Data Store multiple 

times if it is associated with multiple 
cultural affinities, or if it can be 
associated with multiple values of 
NAMEFIELD and/or MORPHTYPE 
for a given cultural affinity. In this 
instance, the correct score must be 
assigned for each MORPHEME- 
MORPHTYPE-NAMEFIELD- 
CULTURE combination associated 
with the Morpheme in question. . 

3.2.4.4.2.6. 1 . For an example of scoring of 
an input name using raw 
scores returned by agents and 
the LID Parameters, see 
Figure 3-4. 

3.2.4.5. Subordinates 
None. 

3.2.5. Ngram Data Store Data Decomposition 

3.2.5.1. Identification 

This data store is referred to as the Ngram Data Store. 

3.2.5.2. Type 

The Ngram Data Store is a data store that is accessed by the 
Ngram Processor. 

3.2.5.3. Purpose 

The Ngram Data Store encodes the knowledge necessary 
for the Ngram Processor function of the LID to add 
evidence based on the distribution of culturally salient 
Ngrams to information needed for the cultural identification 
of the input name. 

3.2.5.4. Function 

3.2.5.4. 1 . Table 3.2-5 contains a description of the data to 
be contained in the Ngram data store. 
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DATA NAME 


DATA TYPE 


DATA WIDTH 


POSSIBLE 
VALUES 


NGRAM 


character 


10 


* 


NAMEFIELD 


character 


I 


{G.S,B1 


NGRAMTYPE 


character 


1 


{E, P. I,S, A} 


SCORE 


integer 


1 


(1.2, 3.4.5) 


CULTURE 


character 


1 


{A, H,0) 



Table 3.2-5 



3.2.5.4.2. The Ngram Processor uses the information 
provided in this data store when gathering 
detailed cultural affinity information to be 
returned to LIA. Ngram values for each of the 
three target cultural groups will be listed in the 
Ngram Data Store. 

3.2.5.4.2. 1 . NGRAM indicates the literal string* 
representation of the Ngram. Note that 
all Ngrams are intended to be located 
as substrings of individual segments of 
the input name. 

3.2.5.4.2.2. NAMEFIELD indicates whether the 
score listed for the given Ngram 
applies to the Given Name (G), to the 
Surname (S). or to Both (B). 

3.2.5.4.2.2. 1 . In the event that the 

NAMEFIELD is listed as "B", the 
associated NGRAM is defined as "in 
position" whether it is found in the 
given name or in the surname field in 
the input name, and is scored 
accordingly, 

3.2.5.4.2.3. NGRAMTYPE indicates the 
linguistic distribution of the NGRAM. 

3.2.5.4.2.3. 1 . PREFIXES (P) are substrings 
which begin in the first 
character position of a name 
segment. 



ANC-E 

Language Analysis Systems, Inc. 



44 



03/19/98 



INFIXES (I) are substrings 
which begin in a character 
position in the name segment 
which is not the first, and end 
in a character position in the 
name segment that is not the 
last. They are substrings that 
are neither at the beginning 
nor the end of the name . 
segment. 

3.2.5.4.2.3.3. SUFFIXES (S) are substrings 
which end in the final 
character position of a name 
segment. 

3.2.5.4.2.3.4. An NGRAM for which the 
NGRAMTYPE is indicated 
as EDGE (E) can be found as 
either a PREFIX or a 
SUFFIX in a name segment 
in the input name, 

3.2.5.4.2.3.5. An NGRAM for which the 
NGRAMTYPE is indicated 
as ALL (A) can be found 
anywhere in a name segment 
in the input name. 

3.2.5.4.2.3.6. NGRAMs that are found in 
positions other than those 
indicated by the . 
corresponding 
NGRAMTYPE are not 
assigned any points for the 
purpose of identifying the 
cultural affinity of the input 
name. 

3.2.5.4.2.4. SCORE is a score for the given 
NGRAM-NAMEFIELD- 
NGRAMTYPE-CULTURE 
combination. The NGRAM scores 
will reflect the predictive value of the 
Ngram for the culture with which it is 



3.2.5.4.2.3,2. 
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associated. This is the score assigned 
by the Ngram Processor when the 
given Ngram is found in the input 
name. For processing details, see 
section 3.1.7, Ngram Processor 
Module Decomposition. 

3.2.5.4.2.5. CULTURE indicates the cultural 
affinity with which the given 
NGRAM-NGRAMTYPE- 
NAMEFIELD-SCORE combination is 
associated. 

3.2.5.4.2.6. An Ngram may appear in the Ngram 
Data Store multiple times if it is 
associated with multiple cultural 
affinities, or if it can be associated with 
multiple values of NAMEFIELD 
and/or NGRAMTYPE for a given 
cultural affinity. In this instance, the 
correct score must be assigned for each 
NGRAM-NGRAMTYPE- 
NAMEFIELD-CULTURE 
combination associated with the 
Ngram in question. 

3.2.5.4.2.6.1. For an example of scoring of 
an input name using raw 
scores returned by agents and 
the LID Parameters, see 
Figure 3-4. 



3.2.5.5. Subordinates 
None. 
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Factors: 



In Field: 10 
OutFicld: 8 



b Field: 8 
OutField: 6 



In FieldSN: S 
OulFieldSN: 3 
In FieldGN: 4 
OutFieldGN: 2 



In FicldSN: 3 
OutFicldSN: 2 
In FicldGN: 2 
OulFieldGN: 1 



In FicldSN: 5 
OutFicldSN: 3 
In RcldGN: 4 
OutFieldGN: 2 



High Frequency SN 


H 


s 


Garcia 


3 
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s 


Salazar 
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s 


Sambrano 


1 
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s 


Greco 
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1 
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el 




0 


B 


-mi 


T 


O 


s 


eni 


T 


0 


B 


lo 




0 


S 1 -agio 


T- 




0 


s 


int 





N,B -Thrdata sha«n here »e lor the purpose of Mustration only «.d do r»t n«c««a.«y reHect 

Sample name: DELGADILLO DE GARCIA. JOSE ANTONIO 



actual table values. 



Hispanic 

Arabic 

Other 



-illo 
(3M) 



Antonio 



de 

(5*1) + 



Garcia 
(10*3) + 



Jose 
(8*3) 



62 



•note that none of these elements are marked as Arabic in the sample data above.- 0 
(3 * 3) + (8 * 2) = 



N.B; The data shown here are for the purposes of illustration only and are not intended to make any 
I statement about actual table values or parameter settings. . . 



Figure 3-4 



3.2.6. Digraph Distribution Data Store Data Decomposition 

3.2.6.1. Identification 

This data store is referred to as the Digraph Data Store. 

3.2.6.2. Type 

The Digraph Data Store is a data store that is accessed by 
the Digraph Distribution Processor. 

3.2.6.3. Purpose 

The Digraph Data Store encodes the knowledge necessary 
regarding the statistical distribution of digraphs within a 
given culture. It is this information that drives the Digraph 
Distribution Processor. 

3.2.6.4. Function 

3.2.6.4. 1 . Table 3.2-6 contains a description of the data to 
be contained in the Digraph data store. 
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DATA NAME 


DATA TYPE 


DATA WIDTH 


POSSIBLE 
VALUES 


DI 


character 


2 


* 


SCORE 


long 


3.4 


{-50.0000- 
+50.00001 


CULTURE 


. character 


1 


1A.H) 



Table 3.2-6 

3.2.6.4.2. The Digraph Processor uses the information . 

provided in this data store when determining the 
contribution that the distribution of digraphs in 
the input name will have in determining the 
cultural affinity of that name Digraph 
Distribution statistics will be listed in the 
Digraph Data Store for each of the specific 
cultures being identified. That is. in the current 
implementation, Digraph Distribution statistics 
will be listed for Arabic and Hispanic; but not 
for "Other". 

3.2.6.4.2. L L DI indicates the literal string 
representation of the digraph. 
Note that digraphs may 
include all alphabetical 
characters as well as the 
word-boundary character "#". 

3.2.6.4.2.2. SCORE reflects the predictive value 
of the digraph for the culture with 
which it is associated. This is the 
score used by the Digraph Distribution 
Processor when the given digraph is 
found in the input name. For 
processing details, see section 3.1.9, 
Digraph Distribution Processor 
Module Decomposition. 

3.2.6.4.2.3. CULTURE indicates the cultural 
affinity with which the given DI- 
SCORE combination is associated. 

3.2.6.5. Subordinates 
None. 



ANC-E 

Language Analysis Systems. Inc. 



48 



03/19/98 



3.2.7. Trigraph Distribution Data Store Data Decomposition 

3.2.7.1. Identification 

This data store is referred to as the Trigraph Data Store. 

3.2.7.2. Type 

Thelxigraph Data Store is a data store that is accessed by 
the Digraph Distribution Processor. The Digraph 
Distribution Processor takes initial and final trigraphs into 
account in producing a digraph distribution score for the . 
input name. 

3.2.7.3. Purpose 

•^'^ The Trigraph Data Store encodes the knowledge necessary 

regarding the statistical distribution of trigraphs within a 
given culture. This information is taken into account in the 
Digraph Distribution Processor^ since name boundaries 
tend to be highly indicative of the cultural affinity of the 
name. 

3.2.7.4. Function 



3.2.7.4. 1 . Table 3.2-7 contains a description of the data to 
be contained in the Trigraph data store. 



DATA NAME 


DATA TYPE 


DATA WIDTH 


POSSIBLE 
VALUES 


TRI 


character 


3 


* 


SCORE 


long 


3.4 


{-50.0000- 
+50.0000) 


CULTURE . 


character 


1 


(A, HI 



Table 3.2-7 

3.2.7.4.2. The Digraph Processor uses the information 

provided in this data store when determining the 
contribution that the distribution of initial and 
final trigraphs in the input name will have in 
determining the cultural affinity of that name 
Trigraph Distribution statistics will be listed in 
the Trigraph Data Store for each of the specific 
cultures being identified. That is, in the current 
implementation, Trigraph Distribution statistics 
will be listed for Arabic and Hispanic, but not 
for "Other". 
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3.2.7.4.2. 1 . DI indicates the literal string 

representation of the irigraph. Note 
jthat,tri graphs niay include all 
alphabetical characters as well as the 
word-boundary character "#". 



3.2.7.4.2.2. SCORE reflects the predictive value 
of the trigraph for the culture with 
which it is associated. This is the 
score used by the Digraph Distribution 
Processor when the given trigraph is 
found in the input name. For 
processing details, see section 3.1.9. 
Digraph Distribution Processor 
Module Decomposition. 

3.2.7.4.2.3. CULTURE indicates the cultural ' 

affinity with which the given DI- 
SCORE combination is associated. 



3.2.7.4.3. Trigraph Distribution statistics for only initial 
and final trigraphs will be included in the 
Trigraph Data Store. 

3.2.7.5. Subordinates 
None. 

3.2.8. Digraph Distribution Processor Parameter Data Store Data 
Decomposition 

3.2.8.1. Identification 

This data store is referred to as the Digraph Processor 
Parameter Data Store, 



3.2.8.2. Type 

The Digraph Processor Parameter Data Store is a data store 
that is accessed by the Digraph Distribution Processor. 

3.2.8.3. Purpose 

The Digraph Processor Parameter Data Store contains 
adjustments that must be made to the digraph distribution 
scores computed by the Digraph Distribution Processor due 
to the fact that sornc cultures are over-represented in the 
digraph model. 
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3.2.8.4. Function 



3.2.8.4. 1 . Table 3.2-8 contains a description of the data to 
be contained in the Trigraph data store. 



DATA NAME 


DATATYPE 


DATA WIDTH 


POSSIBLE 








VALUES 


SKEW 


integer 


3 


I -999 -+999) 


-^.CULTURE 


character 


1 


{A.H) 



Table 3.2-8 

3.2.8.4.2. The Digraph Processor uses the information 

provided in this data store when determining the 
final digraph distribution score to assign to the 
input name. A SKEW will be specified in the 
Digraph Processor Parameter Data Store for 
each of the specific cultures being identified. 
That is, in the current implementation, a SKEW 
will be listed for Arabic and Hispanic, but not 
for "Other**. 

3.2.8.4.2. 1 . SKEW indicates the value to be added 
to or subtracted from the raw digraph 
distribution score by the digraph 
distribution processor to level data 
distribution differences. 

3.2.8.4.2.2. CULTURE indicates the cultural 
affinity with which the given SKEW is 
associated. 

3.2.8.5. Subordinates 
None. 

3.2.9. Threshold Parameter Data Store Data Decomposition 

3.2.9.1. Identification 

This data store is referred to as the Threshold Parameter 
Data Store. 

3.2.9.2. Type 

The Threshold Parameter Data Store is a data store that is 
accessed by the Intermediate Decision Processor 1 (EDPl), 
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the Intermediate Decision Processor 2 (IDP2). and the Final 
Decision Processor. 

3.2.9.3. Purpose 

The Threshold Parameter Data Store contains information 
regarding thresholds that must be met in order for the input 
name to be identified as belonging to a particular target 
culture. 

3.2.9.4. Function 

3.2.9.4. 1 . Table 3.2-9 contains a description of the data to 
be contained in the Threshold Parameter data 
store. 



DATA NAME 


DATA TYPE 


DATA WIDTH 


POSSIBLE 

VALUES 


CULTURE 


character 


1 


|A, H.O} 


LID_THRESHOLD 


integer 


3 


{0-9991 


DLTHRESHOLD 


float 


3.4 


{-999.9999- 
+999.9999} 


UNDER_LID_ 
THRESHOLD 


integer 


3 


(0-999) 


UNDER.DI. 
THRESHOLD 


integer 


3 


{0-999} 



Table 3.2-9 



3.2,9.4.2. The three "decision processor" modules (IDPl, 
IDP2, and the Final Decision Processor) use the 
information provided in this data store when 
determining whether enough information has 
been accumulated to identify the inpiit name as 
belonging to a particular culture. . 
LID^THRESHOLD and 
UNDER_LID_THRESHOLD data values will 
be specified in the Threshold Parameter Data 
Store for each of the cultures being identified, 
including "Other", DLTHRESHOLD and 
UNDER_DI_THRESHOLD values will be 
specified for specific cultures only (i.e. Hispanic 
and Arabic). 
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3.2.9.4.2,1. CULTURE indicates the cultural 

affinity with which the given threshold 
is associated. 

LID_THRESHOLD is used by IDPl in 
determining whether enough 
information has been accumulated to 
identify the input name as belonging to 
a particular culture. For processing 
information, see section 3.1.8, 
Intermediate Decision Processor I 
(LID Decision) Module 
Decomposition. 

3.2.9.4.2.3, DLTHRESHOLD is used in IDP2 in 
determining whether enough 
information has been accumulated to 
identify the input name as belonging to 
a particular culture. For processing 
information, see section 3.1.1 1, 
Intermediate Decision Processor 2 
(Digraph Decision) Module 
Decomposition. 

3.2.9.4.2.4. UNDER^LID_THRESHOLD is used 
by the Final Decision Processor, and 
indicates the amount by which a name 
can fall short of the 
LID^THRESHOLD and still be 
considered for membership in a 
particular culture, provided that other 
criteria are met. As such. 
UNDER_LID_THRESHOLD defines 
a range of values (between the 
UNDER.LID^THRESHOLD and the 
LID.THRESHOLD) that, when 
considered in conjunction with other 
evidence, can result in the input 
name's being identified as belonging 
to the culture in question. For 
processing information see section 
3.1.12, Final Decision Processor 
Module Decomposition and Figure 3- 

, 2. 



3.2.9.4.2.2, 
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3.2.9.4.2.5. UNDER.DLTHRESHOLD is used by 
the Final Decision Processor, and 
indicates the amount by which a name 
can fall short of the DLTHRESHOLD 
and still be considered for membership 
in a particular culture, provided that 
other criteria are met. As such, 
UNDER_DI_THRESHOLD defines a 
range of values (between the 
UNDER_DLTHRESHOLD and the 
DLTHRESHOLD) that, when 
considered in conjunction with other 
evidence, can result in the input 
name's being identified as belonging 
to the culture in question. For 
processing information, see section 
3.1.12, Final Decision Processor 
Module Decomposition and Figure 3- 
2. 

3.2.9.5. Subordinates 
None. 

3.2. 10. COB Proximity (COBPROX) Data Store Data Decomposition 

3.2.10.1. Identification 

This data store is referred to as the COBPROX Data Store. 

3.2.10.2. Type 

The COBPROX Data Store is a data store that is accessed 
by the Final Decision Processor. 

3.2.10.3. Purpose 

The COBPROX Data Store contains information enabling 
the Final Decision Processor to determine which COBs arc 
to be considered as related when determining the cultural 
affinity of the input name. For processing information, see 
section 3.1.12.4.5 and Figure 3-2. 

3.2.10.4. Function 

3.2, 10.4. 1 . 1 . ANC-E will use the CLASS-E 

COBPROX Data Store ("partition 
stable") to fill this function. 
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3,2.10.5. Subordinates 
None. 
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^^rameters> 
HF 

N-Gram 



Linguistically 
Informed Decision 
Processor (LID) 



Linguistic 
Information 
Aggregator 



;col:| 



Arabic Sco?e:{l.. -} 
Hispanic Score: (I..*"} 
'Other' Score: ( I.. 



To 

Digraph Distribution 
Processor 



CLASS-E 
Transaction 

SN'rCN'rcdB*: 



State of Information Stored 




-I 



CLASS-E Traniaction 

Nurcdin Bin Jaffari Mahmoud Taufiq iAFGHj 




Oiher: {T.F) 
Hispanic: (T.F) 
Arabic: {T.F) 



L!A produces detailed scoring 
information (see UA detailed diagram) 



After LID operates on results 



ofLIA 



INuredin Bin Jaffarj Mahmoud Taufiq -AFGH: 
•Arabic Score: 71 \ 
\ Hispanic Score: 2 : 
••Other" Score: 10 ! 



RgsuUQf wn 



Nurcdin Bin J af far] Mahmoud Taufiq 'AFGH 



: Arabic: T : 
i Hispanic: F • 
i Other: F : 

Name is identified as Arabic. 
This result is relumed for 
query/add routing purposes. 



Overview of LID Processing 
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Detailed View of LID Processing 
(p. 60) 
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Name is identified as Arabic. 
(This result is reton«d forqucry / add routing.) 
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SOFTWARE DESIGN DESCRIPTION 
FOR THE ARABIC NAME SEARCH 
ALGORITHM 
FOR CLASS - E 
(ANA - E) 



1. INTRODUCTION 
1.1. Purpose 

The variation that can occur in the transcription of Arabic names into roman representation 
poses formidable problems for retrieval systems with very large databases that depend . 
solely on standard string-comparison techniques. For example, the following names are 
transcription variants of the same name: SALEHUDDINE, IMHEMED and SAALAH 
EL DEEN, MUHAMMED. The significant differences in their spellings and in the 
distribution of white space would virtually preclude any possibility of identifying these 
names as similar enough to be candidates for retrieval if the usual techniques were applied. 
The task, then, is to capture the relatedness of these names and to incorporate the insights 
into their conunonality into the retrieval system. 

Arabic names are made up of a Given Name (GN) (usually one, although compound 
names may occur: SAMIR; MOHAMAD ALI) and a string of familial (paternal) 
relations following the GN (ABD EL KADEER SAMIR ABD EL LATIF). The string • 
following the GN is generally made up of GNs which are taken from the father, 
grandfather and other relations. Only in rare cases can any of these segments be identified 
as a SN, i.e., a name used by every member of the family to signal family membership. 
The full string following the GN provides crucial information about the individual that is 
lost if it is sometimes in the GN field and sometimes in the SN field. So, positioning, 
names that occur after the first GN in the SN field provides the opportunity for better 
matches. 



1.2. Scope 

In 1996, LAS proposed an initial solution to the problem, the salient feature of which 
to level 'spelling differences and thereby generate one representation for the myriad 
spellings of a single name. This process is known as regulariiation, a technique that 
implemented in the Legacy CLASS system as Legacy ANA. Legacy ANA is a 
preprocessing module that feeds into the Legacy CLASS search system. The general 
characteristics of Legacy ANA are: 

r 

1) Both query and add procedures are identical for the Legacy ANA system. 
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2) A rudimentary Arabic name identifier (ANI) determines if an input name 
qualifies for handling by the Legacy ANA algorithm. All names that qualify 
for Legacy ANA handling are also sent to the generic processing module 
provided by Legacy CLASS and to the DOB processor, when appropriate. 

3) A set of regularization rules is applied to the Arabic input name, leveling the 
spelling differences of the name segments to the most common representation 
of the input name. IMHEMED and MUHAMMED will both be regularized 
to the form MUHAMAD, for example. Some title/affix/qualifier information • 
may also be removed from the name to focus on the name stem. The 
regularization rules are rewrite rules that use notation developed solely for this 

.^^^,processor. The rule engine necessary for implementation of the regularization 
rules was also developed specifically to handle the Legacy ANA regularization 
rules. The output of the regularization component is a regularized form of the 
input name. 

4) The output of the regularization component (the regularized form of the name) 
serves as the input to the generic CLASS search system. CLASS produces . 

• standard compressed-name keys, but on the regularized form that is the output 
of Legacy ANA. CLASS accesses the database records through the keys on the 
regularized form. (The keys are generated for both queries and adds and are 
stored with the record when a record is added to the database.) 

5) A digraph match is then performed on the regularized record and query forms 
to determine name similarity. The match criteria are those of CLASS. 

The ANA-E system is an enhancement of the Arabic name search system that was 
developed by LAS for the Legacy CLASS system. The principle of name regularization 
remains the same in ANA-E, althouigh the design and approach of ANA-E are different in 
a number of important ways. 

1) An independent Name Classifier (ANC-E) has been developed. (TheANC-E 
design description is provided as Attachment A in LAS Linguistic Memo 
CT970044, May 30, 1997) ANC-E will direct input names to the Arabic 
and/or Hispanic processors. Its functionality is far more sophisticated than the 
Arabic name typer (ANI). All records will also be directed to the CN.pipe. of 
CLASS-E and to the DOB pipe, if appropriate. 

2) A significant amount of preprocessing of the name takes place in ANA-E that 
recognizes the unique character of Arabic names, focusing on the leftmost GN 
as the most stable element in the name and rejoining all other GN segments 
with their SN partners. 

3) The regularization rules and rule engine have changed. The rules are 
represented in standard regular expressions and the format has changed. The 
rule engine uses different match techniques, is much simpler in its 
implementation and therefore can be easily applied to other rule sets. 

4) The output of the regularization rules is a computationally viable form, one that 
may not be the most common represent^tign of a name (as was a requirement 
of Legacy ANA). 
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5) 



6) 



The new ANA-E rule language has already allowed a dramatic increase in the 
tTX^Z^Z.X. fewer distinct natne-segments. reducing process.ng 

^h^cTs:^^^^^^^^^^^ 

^les Z at the same time permit some unpredictable vanat.on. The _ 
reXSon rules account for much of the predictable vanatton m names bu 
2 o^^Sdentally able to accommodate unpredictable — e^^^O^ 
-7^<^.Retrieval is based on keys that represent a class of prt-determmed vanants 

nSe segment and are formed from the GNl and SN segments. 
■8) G^der Ceen added as a search criterion to reduce the occurrence of 

9) TftS-'SttusedinAN 

and sensitivity to Arabic-specific name characteristics. 

The CLASS-E system will suppon several -c— ^^^^^ 

Pine Architecture (MP A) already in place supports the genenc-CLAss 5c«i, v 

preprocessing module that feeds into the genenc-CLASS processing pipe, g 
not characterized as a separate search pipe.) 

in CLASS-Ethe Multi-Pipe Architecture will be exten^^^^^^^^^^^^ 
processing pipe, a distinct Hispanic processing P-P^ - ^^^^^^^^^^ P^^^ ° ^ 

M input name may be submitted to more P/°"'''3f pass a given input 

J • <• A /pv/r«;n tn determine to which and now many pipcb lu H*"^ "6 
decision of CA/EX/CbU to aeterm nc Advanced Name Classifier for 

name It is suggested that names classified as Arabic by tne Aovant p, . oo p 
crASS E (ANC-E) be submitted to multiple processors, the generic CLASS-E 
processing pipe, thi DOB processing pipe and the ANA-E processing pipe. 

1.3. Definitions and Acronyms 



ACOB 
APE 
AFS 
AG 
AGI 



Arabic COB Category Data Store 
Arabic Data Evaluator 

Arabic Filter and Sorter 

Applicant Gender (user supplied)_ 
Arabic Gender Identifier 



AKG 



Arabic Key Generator 



Arabic Name Search Algo rithm for CLASS - E 



ANC 



AN! 



ANR 



ANT 



APP 



Advanced Name Classifier for CLASS ■ 
Arabic Name Identifier (Legacy ANA) 



Arabic Name Regular izer 



Arabic Name Type Data Store 



Arabic Pre-Processor 



Arabic Rule Engine 
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ARR 


A _.L!a d j^M*t 1 A^iiva^i /\«% OiiIac r^ot^ vLt^^r'p 


ASE 


Arabic Search Engine 


ASP 




ATD 




ATP 




COBPROX 




DELETE 


c ^^^^^t t>>:]i kA ramf\\/^A frrtm anv fiirthpf cnnsidcration in the name 
m-at^hino r»rrtr*»cc- it Will cnntributc mareinailv to the Hlterine process. The 
segment will not be removed from the record. 




Segment will be removed from consideration in the name retrieval process* 
hut will contribute to the filtering and sorting processes. 


r\i \/Ai 


riiornnh Value ,_ 


t: 


Ppmalp Gender 


rNU 


First Name Unknown 


rr 


Filter Parameter Data Store 




Given Name 


VJfM 1 


Leftmost GNl segment 


GNDR 


Gender 




Given Name Threshold (Filter) 


GN VAL 


Final Given Name Value 


Given Name Field 


All n»mp cppments to the ri^ht of the comma 


nr 




nrl 


Arabic Hich Freauencv Name Identifier 




Hlah Frpniiencv Kev 


Lie 

Ha 




ifsi i orN 


Given Name Initial Value 


IINI I ilN 


^iimamp Initial Value 


f Van 


Special Key formed to handle name segments with "k" 




Arabic Name Aleorithm for Leeacv CLASS 




T nu/ Fri*nuencv 


L i r 


! inauicttc Trace Facilttv 


M 


Male Gender 


OPSN 


Out-of-Position Surname Segment 


OPVAL 


Out-of-Position Value (Filter) 


PK 


Primary Key 


RCL 


Refusal Code Level Data Store 


Record Gender 


Gender determined for a record based on two gender validators, input 
gender and HF name gender; all gender indicators must agree. 


Regularization 


Rule-based process that levels the differences among the roman spellings 
of a single Arabic name 


RG 


Record Gender 


RLYOB 


Refusal Code Level/Year-of-Birth Range Data Store 


Segment 


Any single name piece, surrounded by white space 


SI 


Single-Part Key 


SK 


Search Key 


SNTHR 


Surname Threshold (Filter) 


SN VAL 


Final Surname Value 


SP 


Special Key 


SS 


Standard Search Key 


Surname Field 


All name segments to the left of the comma 


TF 


TAO Filter Data Store ' 
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TAQ 


Title/Affix/Qualifier 


TAQAGN 


Value for Missing TAQ in the Given Name 


TAQASN 


Value for Missing TAQ in the Surname 


TAQXGN 


Value for TAQ DELETE in Given Name 


TAQXSN 


Value for TAQ DELECT in Surname 


U 


Unknown (Ambiguous) Gender 


WK 


Wild-Card Key 


YR 


Year-of-Birth Range Data Store 



2. .>^,.IyIODULE DECOMPOSITION 

2.1 . The Arabic Name Search Algorithm for CLASS-E (ANA-E) will consist of 
" three primary components (see pages 6-9 for graphic representations of these 
components): 



• the Arabic Pre-Processor (APP), 

• the Arabic Search Engine (ASE), and 

• the Arabic Filter and Sorter (AFS). 
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2.2. ARABIC PRE-PROCESSOR MODULE DECOMPOSITION 

2.2.1.. Identification 

This module is known as the Arabic Pre-Processor (APP). 
2.2.2. Type 

The APP is the first programming module in the larger ANA-E algorithm and 
• consists of subordinate functions that manipulate the name segments in various 
ways to prepare the name for creation of search keys by Arabic Search Engine. 
--2.2,3. Purpose 

Because of the significant variation that can occur in names that have been 
romanized from the original Arabic script, Arabic names will benefit from 
attempts to level the spelling differences. In addition, the standard format of an 
Arabic name is Given Name followed by a string of segments that indicate 
familial relations. In many countries, none of these segments functions as 
what is standardly referred to as a surname. What is determined to be a 
Surname for purposes of a CLASS retrieval (i.e., what is placed in the Surname 
field) is therefore quite arbitrary. Arabic names will consequently benefit from 
movement of name segments that would contribute to a potential match. 

2.2.4. Function 

2.2.4.1 . The Arabic Pre-Processor (APP) will accept as input any name that 
has been identified as Arabic by the Advanced Name Classifier for 
CLASS-E (ANC-E) and will prepare a name for the Arabic Search 
Engine by applying Arabic regularization rules to the name segments and 
reorganizing the name according to Arabic naming principles. 

2.2.4.2. The APP can alternatively create a name object that "knows" 
characteristics about itself and collects information as it proceeds through 
the processing functions, 

2.2.5. Subordinates 

• Arabic Name Regularizer (ANR) 

• Arabic TAQ Processor (ATP) 

• Arabic Data Evaluator (ADE) 

• Arabic Segment Positioner (ASP) 

• Arabic Gender Identifier (AGI) 

2.3. ARABIC NAME REGULARIZER MODULE DECOMPOSITION 

2.3.1. Identification 

This module will be known as the Arabic Name Regularizer (ANR) and 
will consist of one subordinate processor, the Arabic Rule Engine, which 
will access and apply the rules in one data store, the Arabic 
Regularization Rules (ARR) Data Store. 
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23.2. Type 

The ANR is a program that 

• will operate on a ftill surname string and a full given name string of 
an add or query record, 

• will generate a regularized form for each name segment or string of 
name segments to which the regularization rules have applied, and 

• will submit the regularized form to other functions in the APP to 
continue to prepare the name for submission to the. Arabic Search 
Engine. 

2.3.3. Purpose 

■ The transcription of Arabic names from their native format (Arabic 
script) to the roman alphabet is highly variable; few, if any, transcription 
. standards exist. Such rampant variation poses significant problems for 
string matching and retrieval systems; there are of^en too many characters 
that differ to effect a retrieval in character-based retrieval systems. For 
example, MUHAMMAD and IMHEMED are roman spellings of the 
same name; that is, they are represented by the same string of characters 
in the Arabic script. Leveling the differences in roman spelling, wherever 
possible, would improve record retrieval dramatically. 

2.3.4. Function 

The ANR applies a set of regularization rules (ARR) to the surname and to the 
given name tiirough the Arabic Rule Engine and produces a regularized form 
. for any name segment or string of segments to which the rules can apply. 

2.3.5. Subordinates 

The ANR consists of one subordinate function, the Arabic Rule Engine, which 
accesses the Arabic Regularization Rule Data Store. 

2.4. ARABIC RULE ENGINE MODULE DECOMPOSITION 

2.4.1. Identification 

This function is known as the Arabic Rule Engine (ARE). 

2.4.2. Type 

The ARE is a program that attempts to apply transformation rules to an input 
string of characters and to effect a change in that string. 
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2.4.3. Purpose 

2.4.3.1. The development of rules that are implemented in standard and readily 
accessible regular expressions allows for use of a less idiosyncratic rule 
engine than the one developed for Legacy ANA. (See Arabic 
Regularization Rule Data Store, Section 3.3). 

2.4.3.2. The Arabic Regularization Rules (ARR) require an implementation 
module to effect the changes specified in the rules. The ARE plays that 
role. 

2.4.3.3. The ARE replaces the rule engine developed for Legacy ANA. It is 
simpler and more generic and can be used for other rule implementations. 

2.4.3.4. The ARR are more easily altered and reviewed. 

2.4.4. Function 

2.4.4.1 . The ARE accepts a full surname (SN) or full given name (GN) string 
as input. 

2.4.4.2. The ARE will add a white space to the beginning of the SN or GN 
string that it accepts to serve as boundary markers. 

2.4.4.3. The ARE scans the input string from left to right and anempts to 
match the Match Context of a rule. 

2.4.4.4. If the ARE is able to identify a Match Context, it checks to see if the 
Pre- and Post-Contexts specified in the rule are present. 

2.4.4.4.1. If the Pre- and Post-Contexts specified in the rule match, then 
the ARE applies the rule and makes the specified change in the 
Match Context, producing the Output. 

2.4.4.4.2. The ARE then returns to the top of the rule set and attempts to 

identify a Match Context beginning with the character 
immediately following the previous Match Context. 

2.4.4.5. If no match is found, the ARE moves to the Match Context of the next 
rule. 

2.4.4.6. If no rule has fired, the default rule applies: the character output is the 
character itself. E.g., S S 

2.4.4.7. Arabic Regularization Rules (ARR) 

2.4.4.7.1. The ARRs are written as regular expressions and use, for the 
most part, regular expression notation. See Section 3.3 for ARR 
details. 

2.4.4.7.2. The ARRs use defined metasymbols. 

2.4.4.7.3. The ARE must be able to recognize all regular expression 
notation and metasymbols of the ARR and implement them. 
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2.4.4.7.4. ARRs have the format: 



Figure 1 : Format: Arabic Regularization Rule 

I PRE-CONTEXT | MATCH CONTEXT | POST-CONTEXT || | OUTPUT ~| 

2.4.4.7.5. Rule Ordering 

2.4.4.7.5.1. Rule ordering is important because the first rule for 
which the ARE finds a Match Context (and the Pre- and 
Post-Contexts match as well) will apply. Once the rule 
has applied to the Match Context, no other rules will 
apply to it: No following rule will then fire on that 
Match Context. 

2.4.4.7.5. 1 . 1 . The rules must have internal ordering 
based on the Match Context only. 

2.4.4.7.5.1.2. Rules may intrude in the ordering of the . 
Match Context if they are applicable to another 
phenomenon. 

2.4.4.7.5.1.2.1. For example, an MI -> NE nile 
will need to precede an M -> N 
rule or the MI -> NE will never 
apply. 

2.4.4.7.5.1 .2.2. A rule that applies to a Match 
Context where A AW could 
intervene between the "M" rules 
and have no effect on the order of 
application of the "M" rules. 

2.4.4.7.5.1.2.3. In general, rules with longer 
character strings in the Match 
Context need to precede rules" with 
shorter character'strings. 

2.4.4.7.5.1.2.4. Care must be taken when rules 
have symbols for optional 
characters, for example. The 
ordering of an M?L rule (a rule that 
can apply to ML or L) must be 
carefully placed with respect to 
other rules that apply to M and L. 

2.4.4.7.5.2. The Output of one rule does not form the input to 
another rule. ' 
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2.4.4.7.5.2.1. Only one rule applies to a character or 
character string that matches a Match Context. 

2.4.4.7.5.2.2. The first rule that matches all three 
contexts in the Match, Pre- and Post- Context 
order is applied. 

2.4.4.7.5.2.3. The Output of a rule cannot then be 

changed. 

2.4.4.7.5.2.4. Rules must be written so that they stand 
alone: rules are not interdependent. 

2.4.4.7.5.3. If the ARE is able to match the Match Context, the 
Pre- and Post-Contexts are examined. 

2.4.4.7.5.3.1. If the Pre- and Post-Contexts both match, 
the ARE effects the change in the Match 
Context indicated in the OUTPUT. 

2.4.4.7.5.3 . 1 - 1 . The next available context in the 
input string to be considered for a match 
immediately follows the previous Match Context, 

2.4.4.7.5.3.1.2. For example, if HEIMER is the 
input string and a rule applies to HEIM to make it 
GIM, the next available context for consideration is 
the E of ER (following HEIM). If an E rule is to 
apply, it can only apply to the second E, not that of 
the previous Match Context (HJSIM). 

2.4.4.7.5.3.1.3. The Output of a previous rule cannot 
be the Pre- or Post-Context of a following rule. 

2.4.4.7.5.3.2. The rule is applied only if the ARE is 
successful in matching the Match Context and 
the Pre- and Post-Contexts, 

2.4.4.7.6. There is no backtracking in the ARE. 

2.4.4.7.7. The output of successful application of rule(s) by the ARE is a 
regularized Arabic form. The output of the ARE can be in any 
string form (e.g., binary, regular expression, characters). 

2.4.4.8. Subordinates 
None. 



2.5. ARABIC TAQ PROCESSOR MODULE DECOMPOSITION 
2.5.1. Identification ' 

This function is known as the Arabic TAQ Processor (ATP). 
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2.5.2. Type 

The ATP is a ftinction that identifies titles (T), affixes (A) and qualifiers (Q), as 
specified in the Arabic TAQ Data Store, and implements the disposition 
indicated in that table. 

2.5.3. Purpose 

Arabic names frequently contain peripheral name elements, such as ABDEL, 
ABU, AL. Matching on these segments is not generally useful; the name 
segments with information value are the name stems, RAHMAN, SAYED, 

•^^'--HANAWI. Removal of or disregard for the peripheral name elements allows 
more emphasis to be placed on the name stems. 

'2.5.4. Function 

2.5.4.1. The ATP will access the Arabic TAQ Data Store (ATD) to identify 
titles (e.g., USTAAZ), affixes (e.g., EL DIN) and qualifiers (Q) that occur 
in the regularized name. 

2.5.4.2. The ATP will tag as a T, P, I, S, or Q any such segments found in the 
name, as specified in the ATD. 

2.5.4.2.1. The ATP will scan the flill SN or GN field for any TAQ 
segments. 

2.5.4.2.2. If the ATP identifies a segment, it will tag the segment with 
the ID_NO and disposition, as indicated in the ATD. 

2.5.4.2.3. If the following segment is also a TAQ segment, it will tag the 
segment with the ID_NO and disposition, as indicated in the 
ATD. 

2.5.4.2.4. This will continue until all consecutive TAQ segments have 
been tagged. 

2.5.4.2.5. When the ATP encounters a following segment that is not a 
TAQ segment, it will treat that segment as a stem. 

2.5.4.2.5.1. Each TAQ segment identified up to that point will be 
given the TAQ_TYPE P (prefix) and each will be 
associated and stored with the following stem. 

2.5.4.2.6. The ATP will move to the next segment following the stem 
and will repeat the TAQ identification process. 

2.5.4.2.6.1. The ATP will tag all TAQ segments with the ID_NO 

and disposition. 

2.5.4.2.6.2. When the ATP encounters a stem, it will tag each 
TAQ segment (not yet associated with a stem) with the 
TAQ_TYPE P and will associate and store each TAQ 
segment with the following stem. 
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2.5.4.2.7. If the ATP encounters a TAQ segment (or segments) that has 
no following stem, it will access the ATD to determine if the 
TAQ type is a Suffix (S). 

2.5.4.2.7.1. If the TAQ has a TAQ^TYPE S, the TAQ will be 
associated and stored with the preceding stem. 

2.5.4.2.7.2. The preceding stem may already have prefixal 

TAQs. 

2.5.4.2.7.3. If the TAQ type is not equal to S, the TAQ will be 
tagged a Stranded Affix. 

^'5ir-'-^-2.5.4.3. The ATP will process any TAQ segments identified according to the 
treatment indicated in the ATD. {See Section 3.4.) 

2.5.4.3.1 . Treatment options include DELETE and DISREGARD. 

2.5.4.3.2. DELETE means that the segment is completely disregarded in 
the remainder of the name search process and contributes 
marginal information to the filtering process. (N.B. The segment 
is not deleted from the record.) 

2.5.4.3.3. DISREGARD means that the segment is disregarded in the 
remainder of the name search process but contributes to the 
evaluation of the name in the filtering processes. 

2.5.4.4. TAQ Tag 

- 2.5.4.4.1. The TAQ tag will reference the ID_NO of the TAQ. 

2.5.4.4.2. The TAQ tag will reference the indicated treatment of the TAQ 
segment. 

2.5.4.4.3. The TAQ tag will be associated with a name stem, unless 
marked as a Stranded Affix. 

2.5.4.4.4. Surnames containing the prefix AL (e.g., AL IDRISI) will be 
specially marked. 

2.5.4.5. The TAQ tag will assist in the sorting of records (see Section 2.12, the 
Arabic Filter and Sorter (AFS))! 

2.5.5. Subordinates 
None. 

2.6. ARABIC DATA EVALUATOR MODULE DECOMPOSITION 
2.6. 1 . Identification 

This function is known as the Arabic Data Evaluator (ADE). 
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2.6.2. . Type 

• The ADE is a function that "corrects" data entry errors by generating one or 
more alias records. 

2.6.3. Purpose 

Arabic names are conventionally a Given Name followed by a string of 
(usually paternal) relationships, elements of which are routinely deleted. Some 
data entry operators have apparently attempted to capture the fact that the 
Arabic name is closer to a single name string by entering XXX into the Given 
Name field, cf , presumably the XXX permitted in the COB or DOB fields. 
Because XXX is not a conventional representation of any Given Name 
"^'^'"Information, it interferes with the name search and will be altered, 

2.6.4. Function 

The ADE will determine if the leftmost Given Name segment (only) is XXX. 
If so, it will change that string to FNU and generate an alias add record or 
query. 

2.6.5. Subordinates. 
None. 

2.7. ARABIC SEGMENT POSITIONER MODULE DECOMPOSITION 

2.7.1. Identification 

This function is known as the Arabic Segment Positioner (ASP). 

2.7.2. Type 

The ASP is a processing module that operates on the preprocessed, regularized 
name and moves name segments from the Given Name field into the Surname 
field. Alias records will be produced to reflect format changes. 

2.7.3. Purpose 

Arabic names are made up of a Given Name (GN) (usually one, although 
compound names may occur: SAMIR; MUHAMAD ALI) and a string of 
familial (paternal) relations following the GN (ABD EL KADEER SAMIR 
ADD EL LATIF). This string is generally made up of GNs which are taken 
from the father, grandfather and other relations. In most cases, none of these 
segments be identified as a Surname, i.e., a name used by every member of the 
family to signal family membership. The full string following the GN 
provides crucial information about the individual that is lost if it is sometimes 
in the GN field and sometimes in the SN field. So, positioning names that 

• occur after the first GN in the SN field provides the opportunity for better 
matches. 
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2.7.4. Function 

2.7.4.1. The ASP will move all segments to the right of the leftmost GN 
(GNl) (in the preproce^sed, regularized name) to the leftmost SN 
position, preserving the order of the moved segments. 

Figure 2: Movement of Segments into SN Field 

. I FARUK^MUHAMADSAMIRABDULA | -» | SAMIR ABDULA FARUK, MUHAMAD | 

*iy^.2.7.4.2. The leftmost Given Name (GNl) segment will not be moved into the 
Surname field except 

a) if there is one and only one GN segment and 

b) if there is one and only one SN segment which has been tagged as 
having the prefix AL, 

c) then the ASP will generate an alias record with the SN and GN 
inverted. 



Figure 3: Inversion of SN and GN with AL in the SN Field 



SURNAME 


GNl 




(AL) IDRISI 


YUSEF 




YUSEF 


(AL) IDRISI 





2.7.4.3, The GNl may be a name segment, an initial, or FNU. (See Section 
2.9, the Arabic Search Engine (ASE) for additional information.) 

2.7.5. Subordinates 
None. 

2.8. ARABIC GENDER IDENTIFIER MODULE DECOMPOSITION 

2.8.1. Identification 

This function is known as the Arabic Gender Identifier (AGI). 
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2.8.2. Type 

2.8.2.1. The AGI is a function that will apply after the ANR has produced a 
regularized representation of the input name and the Arabic Segment 
Positioner (ASP) has moved all GN segments other than the GN 1 into the 
SN field. 

2.8.2.2. For the AGI to derive record gender, the data input operator will need 
to supply gender for each record added to the database and for each query 
during the data entry process. 

2.8.3. Purpose 

2.8.3. 1 . Crossed-gender records are of little value to the system user. 

2.8.3.2. Arabic gender is reliably predictable from the GNl . 

2.8.3.3. Records that have crossed gender will receive lowered match values 
during the filtering and sorting process. 

2.8.3.4. Record gender requires gender validation from two sources: gender 
received during the data entry process and predictable gender associated 
with Arabic names. 

• 2.8.3.5. Record gender reduces the chance of associating gender with a name 
that may be misspelled. 

2.8.4. Function 

2.8.4.1. The AGI will derive the record gender for all record adds and queries. 

2.8.4.1 .1 . For each query and add name, the AGI will derive record 
gender from user-supplied gender input and from the gender that 
has been assigned to the GNI. 

2.8.4.1.2. A minimum of two gender indicators is required for a gender 
assignment of M or F . 

' 2.8.4.2. For record adds, gender received as input from the data entry process 
will be stored with the record. 

2.8.4.3. For record queries, the user will input the gender of the applicant at 
query time. 

2.8.4.4. For both adds and queries, the AGI will access the Arabic Name Type 
Data Store (ANT) and will assign the gender value to the GNl segment, 
as indicated in the ANT (GENDER). (See Section 3.5.) 

2.8.4.4.1. If the name is present in the ANT, the gender associated with 
the name segment will be compared to the data entry gender. 

2.8.4.4. 1.1. If the gender indicators match, the matching value 
will become the record gender. 
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2.8.4.4. 1 .2, If the gender indicators do not match, the record 
gender will be Unknown (U). 

2.8.4.4.2. If the name is not present in the ANT, the record gender will 
be marked as Unknown G^)- 

2.8.5. Subordinates 
None. 

2.9. ARABIC SEARCH ENGINE MODULE DECOMPOSITION 

"'2f9:i. Identification 

This module is known as the Arabic Search Engine (ASE). 
2.'9.2. Type 

The ASE is a processing module that accepts the output of the Arabic 
Preprocessor (APP), generates retrieval keys through the Arabic Key Generator, 
retrieves candidate records from the database based on the keys and submits 
those candidate records to the Arabic Filter and Sorter (AFS) Module. 

2.9.3. Purpose 

The regularized, repositioned names generated by the APP will be, in general, a 
representation of the canonical form of the Arabic name. The search process 
will benefit from focus on the canonical form of the Arabic name. 

2.9.4. Function 

The ASE will retrieve records from the database whose stored keys match the 
keys generated for the query record. 

2.9.5. Subordinates 

The ASE has one subordinate module: 
• Arabic Key Generator 

2. 10. ARABIC KEY GENERATOR MODULE DECOMPOSITION 

2.10.1. Identification 

This function is known as the Arabic Key Generator (AKG).- 

2.10.2. Type 

The AKG is a function that will form keys from the GNl and each SN 
segment of the preprocessed, regularized name for both record adds and 
queries. 

2.10.3. Purpose 

In order to reduce the number of records that must be compared by the Arabic 
Filter and Sorter Module, it is desirable to subset the Arabic database. (About 
500,000 records have qualified as Arabic through the ANI name typing 
process and it is assumed that this numljer will continue to represent the 
approximate size of an Arabic database.) One mechanism for achieving a 



ANA-E 

Language Analysis Systems. Inc. 



20 



03/19/98 



subset is to generate keys for the input name. The Arabic keys are motivated 
by the nature of the Arabic name and are centered around the most stable 
name segment in the Arabic name, the GNl. 

2.10.3.1. The Arabic keys replace the compressed-name keys produced for 
Legacy ANA, which have severe limitations for retrieving both 
predictable and unpredictable variants of the regularized Arabic names. 

2.10.3.2. For record adds, all keys will be stored with the source record. 

2.10.3.3. Keys will be generated for each SN segment (moved or resident) and 
' for each GNl. 

2.10.3.3. 1 . For record adds, more keys will be generated for HF name 
segments than for LF segments. 

2.10.3.3.2. Search keys will be a combination of SN keys and GNl keys. 
2.10.4. Function 

2.10.4.1. The AKG will form search keys from a combination of keys for each 
regularized segment in the Surname field and the regularized GNl. 

2.10.4.2. Initials 

All names that contain the same first character as the initial will qualify 
for retrieval on an initial. 

2.10.4.3. FNU 

AH GNl names qualify for retrieval with a GNl of FNU (First Name 
Unknown). 

2.10.4.4. Search Keys 

2. 1 0.4.4. 1 . The AKG will generate a set of Search Keys for each input 
name by conjoining each GNl key with each SN key of the 
regularized, repositioned input name. 

2.10.4.4.2. All search keys generated for an add will be stored with the 
record add and associated with the regularized, repositioned form 
of the name. 

2.10.4.5. Generating Keys 

2.10.4.6. The AKG will produce two categories of keys: 

1 . Single-Part Key (SI): a key formed from the single name segment 
(SN or GNl). All Single-Part Keys will be used to form the Search 
Keys. There are three kinds of Single-Part Key: 

• Primary Key (PK): a key formed on a single name segment (SN 
or GNl) and used to define the set of keys for that segment; 

• Wild-Card Key (WK): a key based on the Primary Key that 
contains wild-card characters; 

• Special Key (SP): a key formed on a single name segment and 
intended to handle specific variation in the regularized name. 
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2. Search Key (SK): a multipart key that will be stored and used for 
retrieval, consisting of a combination of the keys associated with 
every SN segment and those associated with the GNl. 
2. 1 0.4.7. Single-Part Key (SI) 

1. The Primary Key (PK) 

• The PK is formed from one name segment. 

• The PK has a maximum of three characters. 

• The PK has the form CCC or CC or C, where C represents any 
consonant (except in the leftmost position where C may be a 
vowel). 

• The PK is formed from the leftmost character (vowel or 
consonant) of the regularized segment and the following two 
consonants (including H, Y and W). If fewer than two additional 
consonants are available, then the PK may be shorter. 

2. The Wild-Card Key (WK) 

• The WK is formed from the Primary Key. 

• The WKs will have the forms *CC, C*C, CC*. where ♦ represents 
any consonant, except in the leftmost position where it may 
represent a vowel, 

• The WK will have the forms C* and *C with segments that have 
only two candidate characters. 

• A WK will not be formed from a Primary Key with only 1 
component (i.e., C). 

Figure 4: Example: Formation of Primary and Wild-Card Keys 



SEGMENT 


PRIMARY KEY 


WILD-CARD KEYS 


GAMILA 


GML 


♦ML 


G*L 


GM* 


ABASI 


ABS 


*BS 


A*S 


AB» 


SAID 


SD 


*D 


S* 




DAI 


D 


none 







2.10.4.7.1. Special Key (SP) • - 

2.10.4.7.2. The AKG will produce Special Keys (SP) to accommodate 
situations that cannot be accommodate by the ARR. 

2.10.4.7.3. The AKG will generate the Special Keys in addition to the 
Primary and Wild-Card Keys. 

2.10.4.7.3.1. K-Key 

2.10.4.7.3.2. The character K altemates with null in many 
Arabic names, resulting in the potential overlap of many 
names with the K names. 

2.10.4.7.3.3. This phenomenon is not readily handled by the 
ARR, so names with a K require a Special Key. 
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2.10.4.7.3.4. The K-Keys are formed in the following way: 

2.10.4.7.3.4.1. For any segment with K in initial 
position, the following keys are produced: 
♦CC where * represents any character or 
nothing. (This key is equivalent to a WK 
produced for this name.) 

2.10.4.7.3.4.2. For any segment with K in medial 
position, the following keys are produced: 

1 . CkC, where k represents the character "k"; 

2. CCC, where k has been deleted from the 
name string and the CCC represents the 
three leftmost consonants that remain; and 

2.10.4.7.3.4.3. For any segment with K in fmal position, 
the following keys are produced: 

1 . CCk, where k represents the character "k" 
and 

2. CC, where k has been deleted from the 
name string and the CC represents the two 
leftmost consonants (or vowel in first 
position) that remain. 

2.10.4.7.3.5. The standard set of Wild-Card Keys will also be 
produced from the Primary Key for K-names. 

2.10.4.7.3.6. Record Add/Query: The AKG will generate and 
store all K-Keys with the segment. 



Figure 5: Example: Formation of K-Keys 


NAME SEGMENT / VARIANT 


PRIMARY KEY 


K-KEYS 


WILD-CARD KEYS | 


KARSCH 


KRS 




♦RS.K'S, KR* ll 


ARSCH 


ARS 




•RS,A*S,AR* • 


MUKBEL 


MKB 


MBL 


*KB, M*B, MK* 


MUBEL . 


MBL 




*BL, M*L,MB' 


FARUK 


FRK 


FR 


*RK,F*K. FR* 


FARU 


FR 


FR 


.•R.F' 1 



2.10.4.7.3.7. High Frequency Key (HK) 

2.10.4.7.3.8. The AKG will generate Special Keys for High 
Frequency segments found in the input name. 

2.10.4.7.3.9. The AKG will access the Arabic Mame Type 
(ANT) Data Store to identify HF segments. (See Section 
3.5.) 
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2.10.4.7.3.9.1. The ANT will contain a set of Arabic 
name types, the most frequently occurring of 
which will be specified as High Frequency 
name segments (HI^FREQ = 1 (True)). (See 

— Section 3.5 for details). 

2.10.4.7.3.9.2. The AKG will tag as HP all name 
segments in the input record that match one of 
the ARABIC_NAME_TYPE segments for 
which HLFREQ = 1 (is True). 

2.10.4.7.3.9.3. The AKG will tag all other name 
segments as LF. 

2.10.4.7.3.10. Record Add/Query 

2.10.4.7.3.1 1 . The AKG will generate and store the Primary Key 
for any segment that has been tagged as a HF name 
segment. 

2.10.4.7.3.12. Record Add 

2.10.4.7.3.13. The AKG will generate and store all appropriate 
Wild-Card Keys for any segment that has been tagged as 
a HF name segment. 



Figure 6: Example: Primary Key as HF Key 



HF SEGMENT 


PRIMARY KEY 


WILD-CARD KEYS 


MUHAMAD 


MHM 


•HM, M*M, MH* 


AHMED 


AHM 


•HM, A*M, AH* 


ALl 


AL 


♦L,A» 



2.10.4.7.4. Search Keys (SK) 

2.10.4.7.5. The Search Key is a multipart key formed from all keys 
associated with one SN segment and all keys associated with the 
GNl: e.g.,*CC + *CC, *CC + C*C,C*C + CC*,etc. 

2.10.4.7.6. The Search Keys will be the keys used for retrieval of records 
from the database. 

2.10.4.7.7. Search Key Formation 

2.10.4.7.8. To form the set of search keys that will be related to each 
input record, the AKG will combine each SN segment with the 
GNl segment: SNl + GNl. SN2 + GNl, etc. 

2.10.4.7.9. The AKG will determine the frequency (HF or LF) of each of 

the conjoined segments. 

2.10.4.7.10. The number and type of Search Keys will be based on the 
frequency of the name segments. 
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2.10.4.7.11. The AKG will form 

• Standard Search Keys and 

• HF Search Keys. 

2.10.4.7.12. Standard Search Keys-(S&) . 

2.10.4.7.13. Standard Search Keys (SS) are formed for each SN and GNl 
pair. 

2.10.4.7.14. Record Add 

2.10.4.7.15. To form a set of Standard Search Keys, the AKG will 
combine each Wild-Card Key and eachlC-Key of each SN 
segment with each Wild-Card Key and each K-Key of the GNl . 

2. 1 0.4.7. 15.1. For example, each segment with three characters 
(CCC) will have generated three Wild-Card Keys. 

2. 1 0.4.7. 1 5.2. When the keys from two segments with three 
characters each are paired, there will be a total of nine 

keys. 

2.10.4.7.15.3. For segments with fewer characters, there will be 
fewer than nine keys. 

2.10.4.7.16. The AKG will generate and store these keys with the retcord. 



Figure 7: Example: Formation of Standard Search Keys (Record Add) 



REPOSITIONED, REGULARIZED INPUT FORMAT: AHMED BADAWI, MUHAMAD 




GNl: MUHAMAD 


STANDARD SEARCH KEYS 


SNl: AHMED 


SNI+GNl: AHMED MUHAMAD 


*HM»HM, ♦HMM'M. 'HMMH*. 
A*M*HM. A*MM*M, A*MMH*. 
AH**HM, AH*M*M, AH*MH* 


SN2: BADAWI 


SN2+GNI: BADAWI MUHAMAD 


•DW*HM, 'DWM^M, •DWMH*, 
B*W*HM. B*WM*M, B* WMH*. 
BD**HM, BD*M*M, BD*MH* 



2.10.4.7.17. Query . . 

2.10.4.7.18. If either segment of the SN + GNl pair has been tagged as 
LF, the AKG will generate the Standard Search Keys. 

2.10.4.7.19. To form a set of Standard Search Keys, the AKG will 
combine each Wild-Card Key and K-Key of each SN segment 
with each Wild-Card Key and K-Key of the GNl . (See Section 
2.10.4.7.15) 

2.10.4.7.20. HF Search Keys (HS) 

2.10.4.7.21. Query 

2.10.4.7.22. If both segments (the,SN and the GNl) of the conjoined pair 
have been tagged as HF segments, the AKG will form one Search 
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Key from the Primary Key of the SN + the Primary Key of the 
GNl. 

2.10.4.7.23. The High Frequency Search Key will be the only Search 
Key used for a query on the SN + GNl pair when both segments 
are HF segments. 

2.10.4.7.24. Record Add 

2.10.4.7.25. If both segments (the SN and the GNl) that have been ' 
conjoined have been tagged as HF segments, the AKG will form 
one Search Key from the Primary Key of the SN + the Primary 
Key of the GNl. 

2. 1 0.4.7.26. The HS will be stored with a record add. 

2.10.4.7.27. The HS will be a key stored in addition to the Standard 
Search Keys for the record. 

Figures: Example: HF Search Keys 

REPOSITIONE D , REGULARIZED INPUT FORM AT: AHMED ALI. M UHAMAD 
GNl: MUHAMAD(HF)^ ' ~ 



SNl: AHMED (HF) 
SN2: AU(HF) 



SNHGNl: AHMED MUHAMAD 
SN2-K3N1: ALI MUHAMAD 



HF SEARCH KEYS 



AHMMHM 



ALMHM 



2. 11 . Retrieval Function of the Arabic Search Engine (ASE) 

2.1 1.1. The Arabic Search Engine (ASE) will retrieve records from the database 
based on the following criteria: 

• An exact match of the query Search Keys and stored Search Keys and 

• Refusal Code Level and associated Year-of-Birth Range. 

2. 1 1 .2. The ASE will access the Refusal Code Level/Year-of-Birth Range 
(RLYOB) Data Store to determine the YOB range within each Refusal Level to 
search for candidate records. 

2. 1 13. The ASE will retrieve the unique ID and the regularized, repositioned 
form of the record. 

2. 1 1 .3. 1 . Determination of the proximity by the Arabic Filter and Sorter of the 
query and database records will be based on the regularized, repositioned 
form of the record. 

2. 1 1 .3.2. The ASE will eliminate all records with the same unique ID retrieved 
during the retrieval process. 

2,11.4. Subordinates 

Arabic Key Generator. 
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2.12. ARABIC FILTER AND SORTER MODULE DECOMPOSITION ( AFS). 



2.12.1. Identification 

This module is known as the Arabic Filter and Sorter (AFS). 

2.12.2. Type 

2.12.2.1 . The AFS is a module that accepts each regularized database record 
retrieved by the ASE and compares it to the regularized form of the query 
record. 

-•sy-»2. 12.2.2. The AFS is constituted of two subordinate functions: 

• the Arabic Filter and 

• the Arabic Sorter. 

2.12.2.3. The AFS must follo\y the Arabic Search Engine (ASE). 

2.12.3. Purpose 

2.12.3.1. The set of database records that the ASE will remeve will have no 
value relative to the query record. The AFS will evaluate each of the 
records retrieved for its proximity to a query record, will retain those that 
pass a pre-established threshold and will sort the resultant candidate list. 

2. 12.3.2. The filtering process will take into account a number of factors that 
play a role in determining the relative value of Arabic names. 

2.12.4. Function 

2.12.4.1. The AFS will compare the query name and record name to determine 
a relative sumame value and given name value and will generate a 
composite score for the records by accounting for Date-of-Birth, Refusal 
Level and Country-of-Birth proximity. 

2. 1 2.4.2. Arabic Filter Function of the AFS 

2.12.4.3. The Arabic Filter and Sorter will first determine if the query record 

and prime database record (unregularized version) match exactly. 

2.12.4.3.1. The Sumame, Given Name, Date-of-Birth and Country-of- 
Birth must be exact matches. 

2.12.4.3.2. If the two records match exactly, the AFS will tag the record 

as an exact match. 

2.12.4.3.3. The AFS will send the record directly to the Arabic Sorter 
Function. 
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2.12.4.4. The Arabic Filter and Sorter (AFS) will accept the regularized, 
repositioned candidate records retrieved by the ASE. 

2.12.4.5. The AFS will perform a digraph comparison of the regularized, 
repositioned surname segments (stems) of the query record and each 
candidate record. 

2.12.4.6. The AFS will perform a digraph comparison of the regularized given 
name segment (stem) of the query record (GNl) and the given name 
segment (stem) of each candidate record (GNl). 

2. 1 2.4.6. 1 . The score produced by the digraph comparison (DI_V AL) 
will be adjusted by values assigned to several parameters. 

2.12.4.6.2. The score assigned to the surname and to the given name, 
after the parameters have adjusted the DI_VAL, will be the 
SN^VAL and the GN_VAL. 

2.12.4.6.3. Factors that contribute to the determination of the name 
scores (SN_VAL and GN_VAL) include 

• SNTHR 
GNTHR 

• OPVAL 

• INITSN 

• INITGN 

• TAQASN 

• TAQAGN 

• TAQXSN 

• TAQXGN 

• GNDR 

2.12.4.6.4. A final name score will be calculated for each candidate 
database record as it compares to the query record. 

2.12.4.6.4.1, A score for the SN will be calculated: SN^VAL. 

2.12.4.6.4.2. A score for the GN will be calculated: GN_VAL. 

2.12.4.6.5. To be included in the final candidate list, the SN_VAL and 
GN_VAL must each pass pre-determined SN and GN threshold 
levels (SNTHR and GNTHR). 

2.12.4.7. Surname Evaluation 

2,12,4.8 . The AFS will perform a digraph comparison of each SN stem of the 
database record with each SN stem of the query record. 

2.12.4.8.1. The digraph value is determined in the following way: 

2.12.4.8.1.1. The digraphs are identified for each name stem. 

^ 2. 12.4.8. 1.1.1. Eadh pair of alphabetic characters is 

identified: TAFIQ TA/AF/FI/IQ 
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2.12.4.8.1.1.2. A digraph is also formed of the initial 
boundary (#) and the first alphabetic character: 

TAFIQ ^ #T. 

2.12.4.8.1.1.3. A digraph is also formed of the final 
alphabetic character and the final boundary (#): 
TAFIQ -> Q#. 

2. 1 2.4.8. 1 .2. The number of shared digraphs is calculated. 

2. 1 2.4.8. 1 .2. 1 . A digraph may match one digraph only. 

2.12.4.8.1.3. The number of shared digraphs is multiplied by 2 
and divided by the total number of digraphs in 
Comparand #1 added to the total number of digraphs in 
Comparand #2. 

2.12.4.8.1.3.1. The formula is: 
2*d/a + b, 

where d = the total number of shared 
digraphs; 

where a = the total number of digraphs in 
Comparand #1; and 

where b = the total number of digraphs in 
Comparand #2. 

2.12.4.8.1.4. The result is the Digraph Value (DI_VAL) for the 
two Comparands. 



COMPARANDS 


DIGRAPHS 


SHARED 
DIGRAPHS 


DI_VAL 


COMPARAND #1: BADIR 


UB BA AD DI IR R# 
(6 total digraphs = a) 


BA AD DI IR RU 


2'd/a + b = 
10/ 13 


COMPARAND #2: 
ABADIR 


UA AB BA AD DI IR R# 
(7 total digraphs = b) 


-5(d) 


0.77 
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2. 12.4.9. This process is performed for each of pair of Comparands in the 
database and query SN (SNl/SNl, SN1/SN2, SN1/SN3, SN2/SN2, etc.). 

2.12.4.10. Each DI_VAL|s adjusted according to parameter values in the 
Filter Parameter Data Store (see Section 3.6 for details). 

2.12.4.1 1. The AFS will determine if the appropriate parameter conditions are 
met. 

2.12.4. 12. If the appropriate conditions are present, the DI_VAL will be 
multiplied by the value assigned to the parameter and the relative score of 
the two Comparands will be lowered. 

2.12.4.13. Parameter Conditions 

2.12.4.13.1. INITSN: Surname Initial 

2.12.4.13.1.1. Definition: A SN segment is a single character 
and it matches the first character of the other comparand. 

2.12.4.13.1.2. Action: Assign the INITSN value to the 
comparison value (i.e., do not calculate the DI_VAL). 

2.12.4.13.2. OPSN: Out-of-Place Surname 

2.12.4.13.2.1. Definition: A SN segment that is not in the same 
relative position in the SN string in both the database 
and query records. 

2. 1 2.4. 1 3.2.2. Action: Multiply the DI^V AL by the OPSN 
value. (See Figures 10 and 11.) 

2.12.4.13.3. TAQ Filter 

2.12.4.13.4. All TAQ tags (ID_NO, disposition, TAQ_TYPE and 
associated SN stem) will be retrieved with the database record, 

2.12.4.13.5. The AFS will evaluate any TAQs associated with the SN 
segments being evaluated, except Stranded Affixes (see Section 
2.5.4.2.7.3). • - 

2. 1 2.4. 1 3.5. 1 . A Stranded Affix will not play a role in the prefix 
comparison. 

2.12.4.13.6. Single TAQs 

2.12.4.13.7. Missing TAQs 

2.12.4.13.8. TAQASN: Absent TAQ Value 

2.12.4.13.8.1. Definition 1 : One of the two comparands has a 

TAQ tag, the other does not. 

2.12.4.13.8.2. Definition^: Both SN segments have a single 
TAQ tag. one is a TAQ DELETE, the other a TAQ 
DISREGARD. 
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2.12.4.13.8.3. Action: Multiply the DI_VAL by the TAQASN 
value. (See Figures 12 and 22.) 

2.12.4.13.9. TAQ DELETE 

2.12.4.13.9.1. If the TAQ DELETE tags refer to the same TAQ 
segment, the DI_VAL will be unchanged. 

2.12.4.13.9.2. If the TAQ DELETE tags refer to different TAQ 

DELETE segments, multiply the DI_VAL by the 
TAQXSN value. (See Figure 22.) 

2.12.4.13.10. TAQ DISREGARD Processing 

2.12.4.13.10.1. The AFS will access the TAQ Filter Data Store 
(TF) to process SN TAQ segments that have been tagged 
as DISREGARD. 

2.12.4.13.10.2. Definition: The AFS will access the TAQ Filter 
Data Store (TF) to process records if they both contain' 
SN TAQ segments that have been tagged as 
DISREGARD. 

2.12.4.13.10.3. Action 1: The AFS will assign TAQDIS#1 to 
the TAQ DISREGARD segment for the database SN 
segment and TAQDIS#2 to the TAQ DISREGARD 
segment for the query SN segment. 

2.12.4.13.10.4. Action 2: If the two TAQ DISREGARD 
segments match, the DI_VAL will remain unchanged. 

2.12.4.13.10.5. Action 3: If the two TAQ DISREGARD 
segments do not match, the AFS will identify the 
TF^VALUE for the pair in the TF. (See Figure 24.) 

2.12.4.13.10.5.1. The AFS will multiply the DI^VAL by 
the TF_VALUE for the pair. 

2.12.4.13.11. Multipart TAQs 

2.12.4.13.11.1. Definition: If at least one SN comparand has 
multipart TAQ tags (they may be all DISREGARD, all 
DELETE, or mixed DISREGARD/DELETE), the AFS 
will perform the following analyses. 

2.12.4.13.11.2. Action: If all TAQs match, AFS will make no 

change in the DI_VAL. 

2.12.4.13.11.3. TAQ DELETES 

2.12.4.13.11.3.1. Definition: All DELETE tags 

2.12.4.13.11.3.2. Action 1: If any DELETE TAQ 
matched, the AFS applies no change. 
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2.12.4.13.11.3.3. Action 2: If no DELETE TAQs match, 
multiply the D1_VAL by the TAQXSN Value. 

2.12.4.13.11.4. TAQ DISREGARDS 

2.12.4.13.11.4.1. Definition: All DISREGARD tags 

2.12.4.13.11.4.2. Action 1: If any TAQ DISREGARD 
segment matches, the AFS will make no . 
change in the DI_VAL. 

2.12.4.13.11.4.3. Action 2: If no TAQ DISREGARD 

-^W-.... segments match, the AFS will identify the 

highest match value from the TF (TF^VALUE) 
and multiply that by the DI_VAL. (See 
Figures 23 and 24.) 

2.12.4.13.1 1.5. TAQ DISREGARD and DELETES 

2.12.4.13.11.5.1. Definition: Mixed 
DISREGARD/DELETE tags 

2.12.4.13.11.5.2. Action 1: If DISREGARD segments 
are present in both comparands and there is any 
match among the DISREGARD segments, the 
AFS will make no change in the DI_VAL. 

2.12.4.13.11.5.3. Action 2: If DISREGARD segments 
are present in both comparands and there is no 
match among the DISREGARD segments, the 
AFS will determine the highest match value 
from the TF for any DISREGARD tags and 
multiply the DI_VAL by that value. (That is, 
ignore any DELETE tags.) 

2.12.4.13.1 1.5.4. Action 3: If a DISREGARD segment 
is in one comparand and not the other and the 
two comparands have at least one DELETE tag 
that matches, the AFS will make no change in 
the DI_VAL. 

2.12.4.13.11.5.5. Action 4: If a DISREGARD segment 
is in one comparand and not the other and the 
two comparands have DELETE tags that do 
not match, multiply the DI_VAL by the 
TAQXSN. (See Figure 22.) 

2.12.4.14. After all evaluations have been performed, the AFS will choose the 
highest score for each name segment. 

2. 12.4. 14. 1 . The highest score forl>oth the row and column must be 
chosen. 
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2. 1 2.4. 1 4.2. Only one score per row and column is permitted. 

2.12.4.14.3. If .two scores are equal, only one is chosen. 

Figure 10: Example 1: Digraph Evaluation: Equal Number of SN Segments" Dieranh 
Variants BADAWI/BEDAWI » e 





AHMED 


ALI 


BADAWI 


AHMED 


1.00 


0.20 


0.00 


ALI 


0.20 


1. 00 


0.18 


BEDAWI 


0.15 


0.18 


0.71 



^^^r^-ii" Exa"iP^e 2: 'Digraph Evaluation: Different Number of SN Segments; OPSN 





AHMED 


AU 


BADAWI 


AHMED 
BEDAWI 


1.00 
0.15 


0.20 
0.18 


0.00 
0.61 



Figure 12: Example 3: Digraph Evaluation: Same Number of SN Segments- TAO Tac 
Present on One SN * ^ s 





(ABU) AHMED 


SALIM 


SAVED 


AHMED 


0.90 


0.00 


0.28 


SAID 


0.00 


0.36 


0.47 


AKBAR 


0.16 


0.00 


0.00 



?f A wL^'^^TSl^ o ^'^"^^^ Evaluation: Same Number of SN Segments; Different 





(ABU) AHMED 


SALIM 


SAVED 


(BIN) AHMED 


0.50 


0.00 


0.28 


SAID 


0.00 


0.36 


0.47 


AKBAR 


0.16 


0.00 


0.00 



2.12.4.15. The AFS will sum the values chosen from the comparison matrix 
and will divide by the number of values chosen to produce the SN_VAL. 

2.12.4.15.1. In Example 1, 1.00+ 1.00 + 0.61/3 = 0.87 

2.12.4.15.2. In Example 2, 1.00 + 0.61/2 = 0.81 

2.12.4.15.3. In Example 3, 0.90 + 0.47 + 0.00/3 = 0.46 

2. 1 2.4. 1 5.4. In Example 4, 0.50 + 0.47 + 0.00/3 = 0.32 

2. 1 2.4. 1 6. The AFS will compare the SN_VAL to the SNTHR. 

2.1 2.4. 1 6. 1 . The SN^VAL must be equal to or greater than the SNTHR. 

2. 1 2.4. 1 6.2. The record must pass the SNTHR to qualify for the final 
candidate list. 
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2.12.4.17. Given Name Evaluation 

2.12.4.18. The GN has only one segment, the GNl. 

2.12.4.18.1. The AFS will perform a digraph comparison on the 
regularized GNl stem of the database record and the regularized 
ON 1 of the query record. 

2.12.4.18.2. The DI_VAL will be calculated as it was for the SN (see 
Section 2.12.4.8). 

2.12.4.18.3. The DI_VAL will be adjusted by several GN parameters. 

2.12.4.18.4. INITGN: Given Name Initial 

2.12.4.18.4.1. Definition: A GNl is a single character and 
matches the first character of the GNl of the other 

comparand. 

2. 1 2.4. 1 8.4.2. Action: Assign the INITGN value to the 
comparison value (i.e., do not calculate a DI_VAL) 

2.12.4.18.5. TAQ EvaluaHon will proceed as with the SN, mutatis 
mutandi (see Section 2.12.4.13.3). 

2.12.4.18.6. GNDR: Record Gender Value 

2.12.4.18.6.1. The AFS will compare the record gender of the 
input name and the query name. 

2.12.4.18.6.2. If the genders match, no action will take place. 

2.12.4.18.6.3. If the genders do /JO/ match, multiply the DI VAL 
of the GNl by the GNDR value. (See Figure 24.) ~ 

2.12.4.19. The value resulting from all GNl calculations will be the.GN.VAL. 
• ^'^^^ 24°) '""P"'' GN_VAL to the GNTHR. (See Figure 

2.12.4.20.1. The GN_VAL must be equal to or greater than the GNTHR. 
2.12.4.20.2^ The record must pass the GNTHR to qualify for the final 
candidate list. 
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2.12.4.21. Composite Score 

2.12.4.22. The AFS will develop a Composite Score for the two comparands. 

2.12.4.23. The AFS will adjust the GN_VAL and the SN_V.AL by factors that 
reflect the proximity of the Refusal Level, Date of Birth and Country of 
Birth. 

2.12.4.24. The GN_VAL and SN_VAL will be multiplied by factors that apply 
to the RL, DOB and COB. 

2.12.4.25. Refusal Level Factor 

2.12.4.26. The AFS will access the Refusal Code Level Data Store to 
determine the Refusal Level Category of the Refusal Code. 

2.12.4.27. The AFS will access the Filter Parameter Data Store to find the 
PARM_VAL associated with the Refusal Level (RL#). 

2.12.4.28. Date-of-Birth Factor 

2.12.4.29. The AFS will access the Year-of-Birth Range Data Store to 
determine the YOB Category, YOB#, of the Dates-of-Birth of the 
comparands. The highest value is applied to the relationship. 

2.12.4.30. The AFS will access the Filter Parameter Data Store to find the 
PARM^VAL associated with the YOB Category (YOB#). 

2.12.4.31. Country-of-Birth Factor 

2.12.4.32. The AFS will access the Country of Birth Category Data Store to 
determine the COB Category, COB#. 

2.12.4.33. The AFS will access the Filter Parameter Data Store to find the 
PARM^VAL associated with the Country of Birth Category (COB#). 

2.12.4.34. The AFS will calculate a composite score by multiplying the 
SN^VAL by the GN_VAL by the RLU PARM^VAL by the YOB# 
PARM_VAL by the COB# PARM_VAL. 

2.12.4.35. Final Sort Function of the AFS 

2.12.4.36. The AFS will rank order the final candidate list of database records.. 

2.12.4.37. The prime (unregularized) record will be returned to the user. 

2.12.4.37. 1 . There may be significant differences between the query 
record and the qualifying database records. 

2.12.4.37.2. The Composite Score will be returned with the record. 
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2.12.4.38. Any record that is tagged as an exact match will be placed at the top 
of the list. 

2.12.4.39. All remaining records in descending order of Composite Score. 

2.12.4.40. The goal of thcfinal sort is to place exact record matches on the top 
and to rank order the remaining records by the degree of contribution that 
each data element (SN. GN, DOB, COB, Refusal Code Level (RL)) 
makes to the overall record value. 

2.12.4.41. The details of the sort will be derived from extensive discussion 
about the business requirements. 

2.12.4.42. Because the scores from the various pipes may not have been 
calculated in the same way, a method for evaluating the relative value of 
candidate records will have to be devised. 

2.12.4.43. Internal Order 

2.12.4.43.1. There may be cases in which the sorting criteria are met 
equally by more than 1 record. 

2.12.4.43.2. Where multiple records qualify equally, there will be an 



internal sort order. 


2.12.4.43.2.1. 


SN Score 


2.12.4.43.2.2. 


GN Score 


2.12.4.43.2.3. 


DOB Levels 


2.12.4.43.2.4. 


Refusal Levels 


2.12.4.43.2.5. 


COB Relationships 



2.12.4.44. The AFS will return the top n records to the central CLASS-E 
sorter. 

2.12.4.44. 1 . The number of records to be returned will be a system 
setting. 

2.13. LINGUISTIC TRACE FACILITY MODULE DECOMPOSITION 

2.13.1. Identification 

This module is known as the Linguistic Trace Facility (LTF). 

2.13.2. Type 

The LTF is a program that will interact with any or all modules and functions 
within those modules. 

2.13.3. Purpose 

The LTF will allow system evaluators to access information about the system 
functions so that the quality of the content can be ensured. To diagnose and 
remedy problems associated with questionable system results, evaluators must 
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have access to the results of system functionality at various points during the 
processing cycle. 

2.13.4. Function 

2. 1 3.4. 1 . The LTF will be a mechanism that will copy and divert statistics, 
information, processing results to a file outside the nriain processing 
module. 

2.13.4.2. The file will be readily accessible on-line for examining by a system 
evaluator. 

"2.13.4,3. Multiple trace points will be identified when the system is built. 
2.13.4.4. Examples of trace points: 

• What ARRs (by ID_NO) have applied 

• Regularized, repositioned name form 

• All keys generated for a query and for an add 

• SN and GN DI_VAL 

• SN_VAL and GN_VAL 

• Record Gender 

• Sort considerations 

3. DATA DECOMPOSITION 

3.1. DATA 

3.1.1. The input data for an ANA-E query will contain all information that is 
currently required by CLASS and in the standard format required by CLASS. 

• NAME (Surname. Given Name); 

• DOB (Date of Birth; Day Month Year); and 

• COB (Country of Birth; FIPS codes). 
In addition, the following will be specified: 

• Applicant Gender (AG): Male (M), Female (F), Unknown (U). 

• A unique identifier (UID) (as defined in CLASS-E). 

3. 1 .2. For adds, other record information will be entered, as required by CLASS 
and CLASS-E: e.g., refusal code, province of birth. 

3.2. DATA STORES 

The following data stores will be accessed by the ANA-E processing 

components: 

• Arabic Regularization Rules Data Store (ARR) 

• Arabic Title/Affix/Qualifier Data Store (ATD) 

• Arabic Name Type Data Store (ANT) 

• Filter Parameter Data Store (FP) 

• TAQ Filter Data Store (TF) 

• Refusal Code Level Data Store (RCL) 
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• YOB Range Data Store (YR) 

• Refusal Code Level/Year-of-Birth Range Data Store (RLYOB) 

• COBPROX Data Store (COBPROX) 

• Arabic COB Category Data Store (ACOB) 

3.3. ARABIC REGULARIZATION RULES DATA STORE DECOMPOSITION 

3.3.1. Identification 

This rule base is known as the Arabic Regularization Rule Base (ARR). 
3^.2. Type 

3 .3 .2. 1 . The ARR is a set of transformation rules accessed by the Arabic Rule 

Engine. 

3.3.2.2. The ARR will have the following format: 



Figure 14: Format: Arabic Regularization Rule Base 



FIELDNAME 


DATA TYPE 


FIELD SIZE 


DATA VALUE 


ID NO 


integer 


3 


001. ..999 


PRE-CONTEXT 


character 


unlimited 


any ASCII character 


IN 


character 


unlimited 


any ASCII character 


POST-CONTEXT 


character 


unlimited 


any ASCII character 


OUT 


character 


unlimited 


any ASCII character 



3.3.2.3. Definitions 

3.3.2.3.1. ID_NO: a imique, arbitrary numerical reference to the rule. 

3.3.2.3.2. PRE-CONTEXT: preceding context for the element to be 
matched; delimited by preceding and following quotation marks 
("") 

3.3.2.3.3. IN: the match context; the portion of the name that will 
undergo change; delimited by preceding and following quotation 
marks (" ") 

3.3.2.3.4. POST-CONTEXT: trailing context for the element to be 
matched; delimited by preceding and following quotation marks 

("") 

3.3.2.3.5. OUT: the rule output; the realized change in the IN; delimited 
by preceding and following quotation marks (" ") 
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3.3.2.4. There is no internal limit on the size of the Pre-Contexi, In, Post- 
Context or Out, although the system may have an external limit (e.g., the 
maximum size of the SN field). 

3.3.2.5. All rules will use standard regular expression notation, with one 
exception ($), which has been defined specifically for this rule base. 

3.3.2.6. Regular Expression Notation 



Figure 15: Regular Expression Notation 



REGEXF. 
NOTATION 


DEFINITION 




Matches any single character, including white space. 




Stands for all characters that come between the two characters given. This is a standard "from- 
to" notation; with characters, it presumes an A to Z character set. For example, [A-D] will 
match on A or B or C or D. (See [ ] below.) 


[] 


Identifies a class of characters; a match can occur on a single occurrence of any single element 
within the [ ]: [OU] will match O or U. For example, "J[OU]N" will match on JON or JUN but 
not on JOUN. If a + is added to the bracketed expression, "J[OU]+N*\ it will match on any 
combination of any number of Os and Us: JOUUN, JUUON, JOUOUN, JUUN. JOUUN. etc. 
In contrast, OU without [ ] will match only on the exact combination of characters OU: "JOUN" 
will match on JOUN only. 


+ 


Matches one or more occurrences of a preceding character or regular expression, in any order. 
For example, JO+N will match JON or JOON or JOOON, etc., [0U]+ matches OOOU or 
OUOUOU or 0 or UO or OOU or UUUO, etc. 


? 


Matches zero or one occurrence of the preceding regular expression. The expression "JOH?N" 
will match on JON or JOHN but will not match on JOHHN. 


» 


Matches zero or more occurrences of the preceding regular expression. The expression 
"JOH*N" will match on JON, JOHN, JOHHN, JOHHHHN, etc. 


() 


Groups together regular expressions. "(J[OU)N|H[AE)RRYy' will match on JON or JUN or 
HARRY or HERRY. (Contrast with [ ] which identifies a character class and contrast with { } 
which identifies a metasymbol) 


1-^(1) 


Matches either the preceding regular expression or the following regular expression. The full 
expression is all contained within ( ). For example, (ABjAP) will match the character string AB 
or AP. The same expression may also be written A[BP]. 




Defines the context boundary: the Preceding Context, Match Context, Post Context and Output. 
A context that is made up of only one metasymbol and is not bracketed by {-} should not be 
surrounded by " For example, the metasymbol Consonant can stand alone. If the metasymbol 
is enclosed within {}, then all regular expressions contained within the context must be enclosed 
within " For example, "{Consonant} {Letter}'* within one context requires both { } and " 


.{} 


Contains one or more pre-defmed metasymbols. If { } are used, they must be surrounded by " *'. 
If a single metasymbol occurs alone, no { } are necessaiy and therefore no " ** are necessary. 
For example, the single metasymbol Vowel can appear either as Vowel or "{Vowel}". If more 

than one element is used, the metasymbol must all appear within { } and the whole string within 
" ". E.g., Vowel, "{Vowel}", "J{ Vowel} HN" are acceptable formats. 


$ 


Indicates the character to output. S is defined differently from other standard definitions. S is a 
variable that is followed by an integer that references a character in the match string. For 
example, each character in an input string is associated with a different, consecutive integer 
value, up to the number of characters in the match: JONES becomes J » S 1, 0 » $2, N " S3, E » 
$4, S » SS. Reference can be made to the index values in the output string. SI S2 S2 S3 S3 $S 
would represent JOONNS. 
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3.3.2.7. Metasymbols 

A number of meta-symbols will be accessed by the rules. The 
metasymbols are variables declared at the beginning of the ARR Data 
Store. 



Figure 16: ARR Metasymbols 



METASYMBOL 


DEFINITION 


Letter 


•'[A-Z]" 


Consonant 


"[BCDFGHJKLMNPQRSTVWXYZ]" (N.B. Includes W and Y) 


Nog (= No Glide) 


"[BCDFGHJKLMNPQRSTXZ]" (N.B. NoWorY) 


Alia 


"[AEOUI]+L+[EA]+H?" 


Rhyme 


"[AEIOUY]+[BCDFGHJKLNfNPQRSTVWXYZl" 


Kesra 


"([EA]?[IY]+|[IE]+|Y)" 


Dad 


"(Z+}TH+|DH+|DD+)" 


Tha 


"(Z+|TH+|DH+!C+|S+|T+)" 


Jim 


"(DJiJ|Y!GiDZH[DSCH!GG|DY)" 


Gine 


"(GIGHIRH)" 


Qaf 


"(Q|G|K|JlKH|GH|qQU|CK)" 


Kha 


"(KH!K|X!Q|C)" 


Marbuta 


"([EAI]H|[AE]T?)" 


Sun 


"(C|S}N|D|T|R1Z1G1J)" 


Sungem 


"(SS|NN1DD!TT1RR|ZZ|GG|JJ)" 


Moongem ' 


"(BBIFFiGG|HHlJJ|KK|MM|NN|PP|QQ|VV|WW|XX)" 


Vowelgem 


"(AAjEE|II!00|UU)" 


Vowel 


"[AEIOU]" 


Didi 


"(KHKH|SHSH|GHGH!RHRH|DHDH|THTHjCHCH|PHPH)" 


Dig 


"[KSGRDTCPIH" 


Bound 


" " (= white space) 


Anything 




Othergem 


"(CC|LL|YY)" 



3.3.2.8. Purpose 

The ARR allow records with highly divergent spellings and/or 
representations of the same name to be retrieved from the database. Usual 
character comparison techniques are unable to retrieve records with these 
variants, 

3.3.2.9. Function 

The ARR applies relevant rules to each Arabic name field and produces a 
common representation for variant realizations of the same name. 
MUHAMMED, MOHAMMAD and IMHEMED are variant forms of 
the same name; each will be set equal to one single representation of the 
name: MUHAMAD, for example. The successful application of one or 
more rules will produce as output a regularized Arabic name string. 

3.3.2.10. Examples 

3.3.2. 10.1. Example 1 contains two rules that apply to variants of 
ABDULLA: 



ANA-E 

Language Analysis Systems. Inc. 



40 



03/19/98 



EVDILLAH 

ABD ALA 

ABDU ALLA 

ABDULLAH 

OABDELA 

AABDILA 

ABDELILA 



ID 
NO 


PRE- 
CONTEXT 


IN (MATCH CONTEXT) 


POST- 
CONTEXT 


OUT 


.676 




"[HKCQ]?[AEl+[BVl*D+{Vowel)?{Bound){Alla}" 


Bound 


"abdula" 


677 


Bound 


"[HKCQ]?[AE]+[BV]»D+[AEIOU]*L+[IE}*L'AHr 


Bound 


"abdula" 



3.3.2.10.2, Example 2 contains one rule that applies to variants of G: 

• MAGUID 

• MADZHID 

• MADSCHID 

• MADJID 

• MAJID 

• GHASSAN 



Figure 18: Example 2: "G" Regularization Rules 



ID 
NO 


PRE- 
CONTEXT 


IN (MATCH CONTEXT) 


POST-CONTEXT 


OUT 


132 


Anything 


"(DJlGHiDSCHlDZHlJ+)" 


Anything 


"g" 



3.4. • ARABIC TITLE/AFFIX/QUALIFIER DATA STORE DECOMPOSITION 
Because the ANA-E design is viewed as an independent sub-program of the CLASS- 
E system, the Arabic Title/Affix/Qualifier Data Store is presented here as a separate 
table. It is strongly suggested, however, that CLASS-E support one TAQ Data Store 
in which the cultural affmity of each TAQ segment is indicated. This is reduce table 
maintenance and will provide a global picture of the handling of TAQs, 

3.4.1. Identification 

This data store is known as the Arabic Title/Affix/Quaiifier Data Store (ATD). 

3.4.2. Type 

The ATD is a data store that contains the Arabic-specific Title, Affix and 
Qualifier segments and their distribution. It will be accessed by the Arabic 
Preprocessor (APP) and the Arabic Filter and Sorter. 
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Figure 19: Format: Arabic TAQ Data Store 



DATA FIELD 


DATA TYPE 


FIELD SIZE 


DATA VALUE 


ID_NO 


integer 


4 


1...9999 


TAQ FORM 


character 


15 


alphabetics 


TAQ TYPE 


character 


I 


T, P, 1, S, Q 


DELETE 


integer 


1 


1, 0 (True, False) 


DISREGARD 


integer 


1 


UO(True, False) 



3.4.2.1. Definitions 

3.4.2.1.1. ID__NO: a unique, arbitrary number that identifies the TAQ 
segment. 

3.4.2.1.2. TAQ FORM: the string that represents the TAQ; the TAQ 
FORM may be a multipart string (i.e., a string that includes 
internal white space). 

3.4.2.1.3. TAQ TYPE: an indicator of the kind of TAQ segment present: 
a title (T), prefix (P), infix (I), suffix (S) or qualifier (Q). 

3.4.2.1.4. DELETE: 

3.4.2.1.4.1. The segment is to be removed from all further 
consideration in the name search process; it will 
contribute marginally to the filtering process.. It will be 
returned with the record to the user. 

3.4.2.1.4.2. The segment is referenced in the filtering 

process. 

3.4.2.1.4.3. The segment is not removed from the original 
record and is returned with the record to the user, 

3.4.2.1.4.4. True (1) or False (0) indicates whether or not this 
function is to apply to the segment(s) under 
consideration. 

3.4.2.1.5. DISREGARD: * • 

3. 4.2. 1.5.1. The segment is to be removed from fiirther 
consideration in the name search process but will 
undergo special evaluation in the filtering process. It 
will be returned with the record to the user. 

3.4.2. 1.5.2. True (1) or False (0) indicates whether or not this 
function is to apply to the segment(s) under 
consideration. 

3.4.3. Purpose 

Peripheral elements (Titles, Affixes, and pualifiers) in names do not contribute 
as much to the name evaluation as does the name stem. Identifying and 
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removing these elements in the name processing component is important. 
They do, however, contribute to the overall value of a name when compared to 
another name. They will therefore contribute some value to the filtering and 

sorting processes. , . 

3.4.4. Function 

The ATD serves as a repository for all TAQ values and for the treatment that 
each will be subjected to. 

3.5. . ARABIC NAME TYPE DATA STORE DECOMPOSITION 

3.5.1. Identincation 

This data store is known as the Arabic Name Type Data Store (ANT). 

3.5.2. Type 

3.5.2. 1 . The ANT is a data store of unique regularized Arabic name segments. 

3.5.2.2. The ANT is generated only after regularization has applied to the input 
name. 

3.5.2.3. The ANT will have the following format: 



Figure 20: Format: Arabic Name Type Data Store 



DATA FIELD 


DATA TYPE 


FIELD SIZE 


VALUE RANGE 


ID NO 


integer 


5 


00001. ..99999 


ARABIC NAME . TYPE 


character 


24 


alphabetics 


GENDER 


character 


1 


M, F,U 


HI_FREQ 


integer 


1 


1.0 (True or False) 
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3.5.2.4. Definitions 

3.5.2.5. ID NO: a unique, arbitrary numerical reference to the name segment 
(ARABIC_NAME_TYPE) 

3.5.2.6. ARABIC_NAME_TYPE: unique entries that correspond to the 
regularized form of a name segment. 

3.5.2.7. GENDER: the gender associated with a particular name segment: M 

(Male), F (Female), U (UnknownAJnspecified). As records are added to 
the ANT, gender will be specified as U, The gender assigned to new 
table entries will be periodically reevaluated so that names that can be 
identified for gender can be appropriately marked. 

^^^ . 3.5.2.8. HI_FREQ: the frequency of all names will be indicated. True (1) will 
indicate that a name segment is considered a high frequency Arabic name 
segment. All other segments will be marked as False (0). a low- 
frequency name segment. 

3.5.3. Purpose 

The purpose of the ANT data store is to reduce the need to perform repeated 
digraph comparisons on a large store of names and to permit the retrieval of 
gender-matching records. 

3.5.4. Function 

The ANT will provide information about the distinct Arabic name types, their 
frequency and gender. 

FILTER PARAMETER DATA STORE DECOMPOSITION 

6.1. Identification 

This module is known as the Filter Parameter Data Store (FP). 

6.2. Type 

3.6.2.1. The FP is a data store that will be accessed by the Filter Component of 
the Arabic Filter and Sorter (AFS). 

3.6.2.2. The FP is a parameter table that will be accessible to and adjustable by 
the user and whose cell values will be determined through testing and 
comparative evaluation. 

3.6.2.3. The FP has the following format: 



Figure 21: Format: Filter Parameter Data Store 



DATA FIELD 


DATA TYPE 


FIELD SIZE 


VALUE RANGE 


DATA VALUE 


PARM_NAME 


character 


6 


alphabetics 


SNTHR, GNTHR. 
OPSN, INITSN, 
GNDR. INITGN, 

TAQASN, TAQAGN. 

TAQXSN. TAQXGN. 
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PARM VAL 


decimal 1 . 


Fieure22: Example: Filter Parameter 


PARM MAWt 


PARM VAL 


SNTHR 


0.60 


GNTHR 


0.65 


OPSN 


0.60 


INITSN 


0.85 


rMlTGN 


0.85 




0.65 


TAOASN 


0.90 


TAOAGN 


0.90 


TAOXSN 


0.85 


TAOXGN 


0.85 


RLO 


1.20 


RLl 


1.15 


RL2 


1. 10 


RL3 


l.OS 


RL4 


1.00 


YOBO 


1.30 


YOBl 


1.25 


YOB2 


1.20 


YOB3 


1.15 


Y0B4 


1. 10 


YOBS 


1.05 


YOB6 


1.00 


COBl 


1.20 


C0B2 


1.15 


COB3 


1. 10 


COB4 


1.00 


COBS 


0.95 



1 2 4 The values provided are as examples only ana uu 

IcprenTthe PARM.VALs to be used for the pararneters. 

3.6.3. Purpose «Hmstable parameters that 

comparands. 

'■'■UT^ncLsasanindependet^tdatastor^ 

5^'et^rrneeded by the AFS during the f.ltenng process. 

3.7. TAQ FILTER DATA STORE DECOMPOSITION 

3 7 1, Identification / 

This data store is known as the TAQ Filter Data Store (TF). 
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3.7.2. Type 

3.7.2.1 . This TF will be accessed by the Arabic Filter and Sorter and provides 
parameter factors for matching TAQ DISREGARD tags during record 
filtering. 

3.7.2.2. The format oftheTF follows: 



Figure 23: Format: TP Matrix Design 



DATA FIELD 


DATA TYPE 


FIELD SIZE 


VALUE RANGE 


DATA VALUE 


TAQDIS^r • 


character 


8 


alphabetics 


TAQ_DISREGARD ITEM 


TAQDIS#2 


character 


8 


alphabetics 


TAQ_DISREGARD ITEM 


TF VALUE 


decimal 


4 


0.00... 1.00 


Various (TBD) 



3.7.2.3. Definitions 

3.7.2.4. TAQDIS#1 : is the TAQ DISREGARD segment that occurs in one or 
the other (different) of the comparands. 

3.7.2.5. TAQDIS#2: is the TAQ DISREGARD segment that occurs in one or 
the other (different) of the comparands. 

3.7.2.6. TF_VALUE: is the factor that will be used to adjust the SN_VAL or 
GN_VAL if the TAQDIS#1 and TAQDIS#2 are present in the 
comparands. 

Figure 24: Example: TF Sample (Values are for example only) 



TAQDIS#1 


TAQDIS#2 


TF_VALUE . 


ABD EL 


ABD EL 


LOO 


ABD EL 


ABU 


0.75 


ABD EL 


AL 


0.85 


ABD EL 


BIN 


0.75 


ABD EL 


EL DIN 


0.50 


ABU 


ABU 


LOO 


ABU 


AL 


0.85 


ABU 


BIN . 


0.50 


ABU 


EL DIN 


0.85 


AL 


AL 


LOO 


AL 


BIN 


0.85 


AL 


EL DIN 


0.50 


BFN 


BIN 


1.00 


BIN 


EL DIN 


0.85 


EL DIN 


EL DIN 


LOO 



3.7.3. Purpose 

Arabic names often have peripheral name elements. Some of these make up a 
segment of the name, the TAQ values identified in the TF. Their relative 
value, however, varies. Some of them cannot cooccur, some have opposite 
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meanings, so it is necessary to identify their relative value when they are 
contrasted with one another. 

3.7.4. Function 

The TF provides the resources for the AFS to determine the relative value of 
two TAQs that occur in two comparands. 

3.8. REFUSAL-CODE LEVEL DATA STORE DECOMPOSITION 
3.8.1. Identification 

*^^'This data store is known as the Reflisal Code Level Data Store (RCL). 
,3.8.2. Type 

3.8.2.1. It is recommended that the RCL be a parameter file, which can be 
accessed by the client so RC categories can be added to or changed with 
ease. 

3.8.2.2. The RC data store will provide a list of the Refusal Codes and the 
level of seriousness of each Refusal Code. 

3.8.2.3. The RCL has the following format: 



Figure 25: Format: Piece of Refusal Code Level Data Store (RCL) 



DATA FIELD 


DATA TYPE 


FIELD SIZE 


VALUE 


CATEGORY DEFINITION 


GO 


alphanumerics 


3 


RLO 


Most serious RC: 00 


23 


alphanumerics 


3 


RLl 


Type 1 Serious RCs 


6C 


alphanumerics 


3 


RL2 


Type 2 Serious RCs 


07 


alphanumerics 


3 


RL3 


Type 1 Non-serious RCs 


G 


alphanumerics 


3 


RL4 


Type 2 Non-serious RCs 













3.8.2.4. Definitions 

3.8.2.4.1. DATA FIELD: indicates each Visa Refusal Code (Codes and 
their Refusal Level (see VALUE) are for example only; they do 
not represent the complete list nor the accurate assignment of a 
Refusal Code to a Refusal Level). 

3.8.2.4.2. DATATYPE: The RL# will appear in the form RLl, RL2, 

etc. 

3.8.2.4.3. VALUE: RL# is the Refusal Level category to which are 
particular Refusal Code has been assigned. The Visa Office will 
assign Refusal Codes to one of 4 categories: RLl , RL2, RL3, 
RL4; RLO is reserved for the Refusal Code 00. (The current 
distinction among Refusal Codes is a binary one: serious and 
non-serious. Assignment of Refusal Codes to more groups has 
not yet been done; the consequence is that one or more of these 
categories may not have a distinct value.) The RL# occurs in 
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ascending order, from most serious to least serious Refusal Code. 
The RL# will be linked to a Year-of-Birth Code (see Section 3,9) 
to determine the relevant subsets of records to be searched. 
3.8.2.4.4. CATEGORY DEFINITION: 

• RCO refers to the Refusal Code 00. 

• RCl refers to all Refusal Codes that have been designated as 
Type 1 Serious RC 1, i.e., the most serious, excluding GO. 

• RC2 refers to all Refusal Codes that have been designated as 
'^ype 2 Serious RC, i.e., serious but less serious than RCO 
and RCl. 

• J^C3 refers to all Refusal Codes that have been designated as 
Type 1 Non-Serious RC. These codes are less serious than 
the RCO, RCl and RC2 codes. 

• RC4 refers to Refusal Codes that have been designated as • 
Type 2 Non-Serious. These codes are the least serious codes, 
less serious than the RCO, RCl, RC2 and RC3 codes. 

3.8.3. Purpose 

It has long been desirable to make more granular distinctions among the 
Refusal Codes. For many years, DOS has maintained a distinction between 
serious and non-serious codes; these different categories were correlated with 
different YOB search ranges. However, a mechanism for making greater 
distinctions will provide greater flexibility in delimiting the set to be retrieved 
during the first stage of record analysis. The introduction of five refusal code 
levels also provides the opportunity to correlate more year-of-birth ranges to 
the refusal code levels. 

3.8.4. Function 

The RCL provides information needed for the evaluation of record proximity 
in the Arabic filtering process and contributes to the delimitation of database 
records, retrieved through the RL/YOB Data Store. 

3.9. YEAR-OF-BIRTH RANGE DATA STORE DECOMPOSITION 

3.9.1. Identification 

This data store is known as the Year-of-Birth Range Data Store (YR) . 

3.9.2. Type 

3.9.2.1. It is recommended that the YR be a parameter file, which can be 
accessed by the client so YOB ranges can be set. Alternatively, it could 
be represented as a system parameter whose value(s) are set in an .ini file. 

3.9.2.2. The YR will define the YOB ranges that will be associated with a 
Refusal Level (see Section 3.8). 

3.9.2.3. This data store has the following format: 
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Figure 26: Format: Year-of-Birth Ra nge Data Store (YR) 




3.9.2.3.1. Definitions 

3 923 2 DATA FIELD: YOB# is the Year-of-Birth Range category 

whose value indicates the year-of-birth range to be searched. The 
year-of-birth VALUE indicates the search range, that is, the 
number of years on either side of a given year-of-birth to be 
searched. For example, if the input year is ^^62 and YO^^^ 
is4 the search will cover a range ofnine years. 1958-1966. Ihe 
range includes the full year, so all of 1958 and all of 1966. 
3 9 2 3 2 1 There are seven YOB# categories, YOBO, YOBl, 
' " * YOB2,YOB3,YOB4, YOB5,YOB6. 

• YOBO is a single integer that refers to an exact 
month, day. year of birth. If YOBO is specified, the 
system must be able to match the month, day and 
year of the Date of Birth of an input record and a 
database record. 

• YOBl is a single character (A) that refers to an exact 
year-of-birth with the month and day inverted. 

1 If YOB 1 is specified, the system must be able to 
match the year of Date of Birth and an inverted 
month and day (DEC 03 ^ MAR 12) of the 
input record and the database record. 

2. YOB 1 will be relevant to the Arabic Filter and 
Sorter, but may not function as a search 
parameter since the value would be subsumed in 
Y0B2. 

• Y0B2 is a single character (B) that refers to an exact 
year-of-birth. If Y0B2 is specified, the system must 
be able to match the year of the Date of Birth of an 
input record and a database record. 

• Y0B3 is a one- or two-place integer ( 1 . . .99) that 
refers to a narrow year-of-birth range. Narrow year- 
of-birth range is usually defined as 1 year (for a 
search range of 3 years). 
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■ • YOB4 is one- or two-place integer ( 1 . . .99) that 
refers to a standard year-of-birth range. Standard 
year-of-birth range is usually defined as 3 years (for 
a search range of 7 years). 

• YOBS is a one- or two-place integer ( 1 . . , 99) that 
refers to a wide year-of-birth range. Wide year-of- 
birth range is usually defined as 5 years (for a se^u■ch 
range of 1 1 years). 

• Y0B6 is a one- or two-place integer (1 . . .99) that 
refers to an unlimited or extremely wide year-of- 
birth range. Unlimited year-of-birth range would be 
set sufficiently high to include all (or all desired) 
years-of-birth in the database (e.g., 50). 

3.9.3. Purpose 

This YR provides a greater granularity in the year-of-birth range and, 
therefore, greater flexibility in delimiting the set to be retrieved during the 
first stage of record analysis. The correlation of five refusal code levels 
to different year-of-birth ranges will help to delimit the number of records 
to be searched and to define the more valuable set of records. 

3.9.4. Function 

3.9.4.1. The YR permits greater granularity in the Date-of-Birth types related 
to the system. 

3.9.4.2. The YR will be accessed by the Refusal Code LeveinrOB Range Data 
Store, which will limit the retrieval range in the Arabic Search Engine. 

3.9.4.3. The YR will contribute to the Arabic Filter and Sorter to contribute 
information to the composite score. 



3.10. REFUSAL CODE LEVEL / YOB RANGE DATA STORE MODULE 
DECOMPOSITION 

3.10.1. Identification 

This data store is known as the Refusal Code Level/YOB Range Data Store 
(RLYOB). 

3.10.2. Type 

3.10.2.1. The RLYOB is a matrix that merges the values in the Refusal Code 
Level (RCL) Data Store and the Year-of-Birth Range (YR) Data Store. 

3.10.2.2. For each Refusal Level (RL), a Year-of-Birth (YOB) Range is 
specified. 

• 3.10.2.2.1. Only one YOB Range for each RL is permitted. 
3.10.2.2.2. The same YOB Range m4y apply to more than one RL. 
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3.10.2.3. The RLYOB has the following format: 



Figure 27: Format: Refusal LevelA'ear-of-Birth Range Data Store (RLYOB) 



DATA FIELD 


DATA TYPE 


FIELD SIZE 


VALUE RANGE 


DATA VALUE 


RL# 


character 


3 


RLO.,.4 


RL0.RLl,RL2, RL3, RL4 


YOB# 


character 


4 


YOB0...6 


YOBO, YOBl. Y0B2, Y0B3. YOB4, YOBS, Y0B6 



Figure 28: Example: RLYOB Data Store 





YOB# 


RLO 


YOBS 


RLP'-^^ 


Y0B4 


RL2 


YOB3 


RL3 - 


Y0B3 


RL4 


YOB2 



3.10.2.4. Definitions: 

3.10.2.5. RL#: is a character string that indicates the Refusal Level of the 
Refusal Code. 

3.10.2.6. YOB#: is a character string that indicates the Date-of-Birth Range 
Category of the comparands. 

3.10.3. Purpose 

Retrieval of records from the database should be delimited by a relationship 
between the Refusal Code Level and the Year-of-Birth Range. It will restrict 
the number of records to be reviewed. 

3.10.4. Function 

The RLYOB is a resource for the Arabic Search Engine to delimit the records 
retrieved from the database. 

3.11. COUNTRY-OF-BIRTH PROXIMITY DATA STORE DECOMPOSITION 

3.11.1. Identification 

This data store is known as the Country-of-Birth Proximity Data Store 
(COBPROX). 

3.1 1.2. Type 

3 . 1 r .2. 1 . The COBPROX is a matrix whose cells contain a decimal that 

reflects the degree of relationship between the country represented on the 
X-axis and the country represented on the y-axis. 

3 . 1 1 .2.2. The COBPROX has the following format: 
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Figure 29: Format: COBPRQX Data Store 



DATA FIELD 


DATA TYPE 


FIELD SIZE 


VALUE RANGE 


DATA VALUE 


COB^^l 


. character 


4 


alphabetics 


COB Code 


C0B#2 


character 


4 


alphabetics 


COB Code 


COBVAL 


decimal 


4 


.0.00... 1. 00 


Various 



Figure 30: Example: Piece of COBPRQX Data Store 



COB#l 


C0B#2 


COBVAL 


AGS 


AGS 


1.00 


AGS 


ALG 


0.05 


AGS 


MORO 


0.05 


AGS 


SARB 


0.05 




SYR 


0.05 


ALG 


ALG 


1.00 


ALG 


MORO 


0.85 


ALG 


SARB 


0.75 


ALG 


SYR 


0.75 


MORO 


MORO 


LOO 


MORO 


SARB 


0.75 


MORO 


SYR 


0.75 


SARB 


SARB 


1..00 


SARB 


SYR 


0.75 


SYR 


SYR 


LOO 









3.11.2.3. Definitions: 

3.11.2.3.1. C0B#1: is the 4-character COB Code of one of the 
comparands. 

3.11.2.3.2. C0B#2: is the 4-character COB Code of one of the 

comparands. 

3.1 1.2.3.3. COBVAL: is the decimal value assigned through the ACOB 
(and other COB Category Data Stores). 

3.11.3. Purpose 

The COBPROX Data Store provides information on the relative valiie of the 
. COBs in two comparands. This value can serve to limit the COBs that are 
accessed for retrieval. 

3.11.4. Function 

The COBPROX is populated by the ACOB and any other partition-specific 
Country-of-Birth Category Data Stores. The COBPROX provides COB 
relationship information. 
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3.12. ARABIC COUNTRY-OF-BIRTH CATEGORY DATA STORE 
DECOMPOSTION 

3.12.1. Identification 

This data store is known as the Arabic Country-of-Birth Category Data Store 
(ACOB). 

3.12.2. Type 

This ACOB is a data store that will be serve as the source of information for 
the COBPROX Data Store, supplying the COBVAL, and will provide the COB 
Category (COBCAT) necessary for the Arabic Filter and Sorter. • 



Figure'3¥r^Format: Arabic Country-of-Birth Category Data Store (ACOB) 



DATA FIELD 


DATA TYPE 


FIELD SIZE 


VALUE RANGE 


DATA VALUE 


C0B#r 


characters 


4 


alphabetics 


COB Code 


C0B#1 


characters 


4 


alphabetics 


COB Code 


COBCAT 


characters 


5 


alphanumbertcs 


COB1...COB99 


COBVAL 


decimal 


4 


0.00... LOO 


Various 



3.12.3. Definitions 

3:12.3.1. C0B#1: is the 4-character COB Code of one of the comparands. 

3.12.3.2. C0B#2: is the 4-character COB Code of one of the comparands, 

3.12.3.3. COBCAT: is the category assigned to the relationship of two COBs. 

3.12.3.3.1. Categories might include such relationships as Exact, State, 
Geographic Region, Dialect Region. 

3.12.3.3.2, All relationships are adjustable. 

3.12.4. COBVAL: is the decimal value that will be assigned to a particular COB 
relationship; this value will be used to determine the COBs that will be 
permitted in the retrieval process. 

3.12.5. Example COB Categories might be: 

GOBI: Exact represents an exact match of the COBs: 
ALG/ALG; the COBPROXVAL would be 1.00. 

COB2: Western Dialect Region represents the set of COBs 
that are in close geographic proximity and share naming 
conventions: ALG/MORO. The score would be something less 
than that applied to an exact match but nonetheless high: 0.85. 

C0B3: Arabic Partition represents all COBs within the 
Arabic partition. The value assigned would be less than that for 
C0B2: 0.75. 

COB4: All refers to all COBs and is assigned a value that 
will allow the search of all COBs; it would be the lowest decimal 
value used. 
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3.12. ARABIC COUNTRY-OF-BIRTH CATEGORY DATA STORE 
DECOMPOSTION 

3.12.1. Identification 

This data store is known as the Arabic Country-of-Birth Category Data Store 
(ACOB). 

3.12.2. Type 

This ACOB is a data store that will be serve as the source of information for . 
the COBPROX Data Store, supplying the COEVAL, and will provide the COB 
Category (COBCAT) necessary for the Arabic Filter and Sorter. 



Figure^3yh'ForTnat: Arabic Country-of-Birth Category Data Store (ACOB) 



DATA FIELD 


DATA TYPE 


FIELD SIZE 


VALUE RANGE 


DATA VALUE 


C0B#r' 


characters 


4 


alphabetics 


COB Code 


C0B#1 


characters 


4 


alphabeiics 


COB Code 


COBCAT 


characters 


5 


alphanumberics 


COBI...COB99 


COBVAL 


decimal 


4 


0.00... 1.00 


Various 



3.12.3. Definitions 

3:12.3.1. C0B#1 : is the 4-character COB Code of one of the comparands. 

3.12.3.2. C0B#2: is the 4-character COB Code of one of the comparands. 

3.12.3.3. COBCAT: is the category assigned to the relationship of two COBs. 

3.12.3.3.1. Categories might include such relationships as Exact, State, 
Geographic Region, Dialect Region. 

3.12.3.3.2, All relationships are adjustable. 

3.12.4. COBVAL: is the decimal value that will be assigned to a particular COB 
relationship; this value will be used to determine the COBs that will be 
permitted in the retrieval process. 

3.12.5. Example COB Categories might be: 

GOBI: Exact represents an exact match of the COBs: 
ALG/ALG; the COBPROXVAL would be 1 .00. 

COB2: Western Dialect Region represents the set of COBs 
that are in close geographic proximity and share naming 
conventions: ALG/MORO. The score would be something less 
than that applied to an exact match but nonetheless high: 0.85. 

C0B3: Arabic Partition represents all COBs within the 
Arabic partition. The value assigned would be less than that for 
C0B2: 0.75. 

C0B4: All refers to ail COBs and is assigned a value that 
will allow the search of all COBs; it would be the lowest decimal 
value used. 
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3.12. ARABIC COUNTRY-OF-BIRTH CATEGORY DATA STORE 
DECOMPOSTION 

3.12.1. Identification 

This data store is known as the Arabic Country-of-Birth Category Data Store 
(ACOB). 

3.12.2. Type 

This ACOB is a data store that will be serve as the source of information for 
the COBPROX Data Store, supplying the COBVAL, and will provide the COB 
Category (COBCAT) necessary for the Arabic Filter and Sorter. 



FigureWr^Tormat: Arabic Country-of-Birth Category Data Store (ACOB) 



DATA FIELD 


DATATYPE 


FIELD SIZE 


VALUE RANGE 


DATA VALUE 


coB#r- 


characters 


4 


alphabetics 


COB Code 


COB#l 


characters 


4 


alphabetics 


COB Code 


COBCAT 


charactere 


5 


alphanumberics 


COB1...COB99 


COBVAL 


decimal 


4 


0.00... 1.00 


Various 



3.12.3. Definitions 

3; 12.3.1 . C0B#1 : is the 4-character COB Code of one of the comparands. 

3. 12.3.2. C0B#2: is the 4-character COB Code of one of the comparands. 

3.12.3.3. COBCAT: is the category assigned to the relationship of two COBs. 

3.12.3.3.1. Categories might include such relationships as Exact, State, 
Geographic Region, Dialect Region. 

3.12.3.3.2. All relationships are adjustable. 

3.12.4. COBVAL: is the decimal value that will be assigned to a particular COB 
relationship; this value will be used to determine the COBs that will be 
permitted in the retrieval process. 

3.12.5. Example COB Categories might be: 

GOBI: Exact represents an exact match of the COBs: 
ALG/ALG; the COBPROXVAL would be 1 .00. • ■ 

COB2: Western Dialect Region represents the set of COBs 
that are in close geographic proximity and share naming 
conventions: ALG/MORO. The score would be something less 
than that applied to an exact match but nonetheless high: 0.85. 

C0B3: Arabic Partition represents all COBs within the 
Arabic partition. The value assigned would be less than that for 
C0B2: 0.75. 

COB4: All refers to all COBs and is assigned a value that 
will allow the search of all COBs; it would be the lowest decimal 
value used. 
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Figure 32: Example: Piece of ACOB (Values for example only.) 



C0B#1 



ALG 



ALG 



ALG 



ALG 



MORO 



MORO 



MORO 



C0B#2 



ALG 



MORO 



SARB 



SYR 



MORO 



COBCAT 



COBl 



C0B2 



C0B3 



C0B3 



COBl 



SARB 



SYR 



SARB 



SARB 



SYR 



SARB 



SYR 



SYR 



C0B3 



COBVAL 



1.00 



0.83 



0.75 



0.75 



I. CO 



COBS 



COBl 



COBS 



COBl 



0.75 
0.75 



1.00 



0.75 
1.00 



3:12.6. Purpose 

Pre-defmed COB category relationships will provide a definition of the values 
that appear in the COBPROX Data Store. 
3.12.7. Function 

These COB categories will provide information about COB relationships that 
will contribute to determination of the Composite Score in the Arabic Filter 
and Sorter. 
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ALGORITHM 
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1. INTRODUCTION 
l.l. Purpose 

The current VLDB consists of about 5 million refusal records. The outlook 
envisions significant growth of the database in the near future and continued growth 
as more and more data are shared with other Government agencies. Currently, 
between 45% and 50% of the records have a country of birth from the Hispanic 
world, about 2.5 million records. These proportions are unlikely to change as the 
database expands. 

Additionally, the character of Hispanic personal names is such that they are both 
dense and complex. Dense means that there are a relatively few individual surnames 
that account for the vast majority of surname occurrences. That is, the 500 most 
frequently occurring distinct surnames account for over 70% of all distinct surnames 
in the database. The sumanies of well over 50% of the records contain only high 
frequency surnames. Another 25-30% contain at least one of the high frequency 
surnames. Complex means that Hispanic surnames generally contain more than 1 
surname, the first of which is the family name, the second a matronymic (FLORES 
GOMEZ). Approximately 75% of the surnames from the Hispanic partition contain 
2 surname stems (not including affixes like DE, DE LOS). Another 23% have only 1 
surname stem. (The remaining records have 3-6 stems.) 

The frequency of the names, the^high portion of the VLDB and the syntactic 
variation that can occur in these names (inversion of the names, deletion of a name) 
argue for special handling of the Hispanic name search process. , 

The most important aspect of this specialized Hispanic name search algorithm is an 
efficient High Frequency Name Processor. Retrieval of fewer records for evaluation, • 
yet ones that reflect some variation, is the goal of the High Frequency Processor. 

The High Frequency Processor (HFP) of the HNA-E system targets the efficient 
processing of the most frequently occurring records in the Hispanic portion of the 
database. Early attempts at developing a processor that would handle high frequency 
Hispanic names had several major weaknesses. 
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• The earlier processors did not adequately address the characteristics of Hispanic 
names. In the name of performance, they did not allow for any variation in high 
frequency names. 

• There was only one access method to the high frequency processor, which 
eliminated the processing of names similar to the high frequency names by the 
High Frequency Processor, 

• Strict, often unmotivated, limitations were placed on the high frequency retrieval 
process. Little to no spelling or syntactic variation was permitted. 

• The number of records retrieved was often extremely high, which resulted in a 
significant amount of post-processing. 

"T^lVof these issues have been addressed in the HNA-E design. The HFP will be 
primarily list-based but the lists are empirically developed. It will identify and store 

' relevant information about names, variants and their degree of proximity and will 
apply record similarity criteria before retrieval. 

Low frequency Hispanic names, on the other hand, carry more information value 
because they are less usual. However, even low frequency names occur with 
sufficient frequency to challenge the system; the Hispanic database in general is very 
large. Preprocessing low frequency names will, therefore, also help reduce the 
number of records retrieved by limiting the search criteria. 

1.2. Scope 

The HNA-E system is intended to provide special and unique handling for names 
identified as Hispanic by the Automatic Name Classifier (ANC-E). It addresses the 
problem of highly frequent names to maximize retrieval potential and minimize the 
impact on performance and handles less frequently occurring names differently to 
accommodate the greater information content in these names. It also allows for 
broad variation in low frequency names and identifies potentially relevant records 
before database retrieval. 

The input into the HNA-E system will be the output of the Advanced Name 
Classifier (ANC-E). ANC-E will determine if a name is Hispanic and therefore will 
undergo special processing by the Hispanic Name Algorithm (HNA-E). The design 
description of the ANC-E is contained as Attachment A in LAS Linguistic 
Memorandum CT970044 (May 30. 1 997). 

It became clear during the research for this design that the data stores that would be 
seminal to this system were very large. The Low Frequency Surname Type Data 
Store, for example, has over 90,000 records in it. Well over 37,000 of these names 
occur one time in the database; many of these are obvious misspellings or truncations 
of names. That is. the character strings do not occur in Spanish: /?/?ODRIGUEZ. for 
example. It is suggested that a program of data stewardship.be initiated to increase 
the efficiency of the system and reduce the storage needed for deviant material. One 
method of introducing data stewardship at this juncture would be to introduce Base 
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Records for the database records with errors and make the current database 
the new Base Record. 



1 .3. Definitions and Acronyms 



ANC-E 


Advanced Name Classifier for CLASS-h ^ 


DELETE 


^ ' 1 .^1.. :«i ik* r«»mflinH^r nf the na^me search oroccss 
The name segnwnuis completely disregardea in mc remainaer ui mc name a&aiwn pi 

, ^ . * ■ 1 :_ fju. lyiatirLti tes. »k» rmnrxrA pvflliiattfin firocess*. do ROt fcmove the 

and contnbutes minimal miormaiion lO ine recora cvaiuauun piu<.ess, w 

segment from the record ^ . . 






DISREGARD 


^ „„^^^^» A'tcrannrAt'A in ihf> rpmniTxAcT of iHc nsme search oroccss but 

The name segment is uisregaraeu in inc icuiouiuti ui wiw iiwuw 

^^^♦,:u.,t»c tr^ Awahiatinn nf the name in ihe record evaluation process; do not remove 
contriouics lo ine evaiuauon ui uic iiaiiic m iwwwiw \,wt»m^t»* . 


DI_KEY 




di_vaL 


n;«r-»«K v«>iii*i fturn.n\ar^ riertmal indicatinc dieraoh relation of two comparands. 


F 




FNU* 




FPD 




FTI 


CM<-iiiAnr>tr TvnA TH^ntlH^r 


GN 




GNDR 




UN 1 HK 


ni\t^n Nnm^ TTirpchnlH ^fihpr niialification) 




^i^rAn Nlnmp TnitinI TCpv 


GN_VAL 


Cinol r\t\>fT\ Mam^ Vnlite 


HCD 


Hicmnir Phinrtpr Data SfOfP 


HDM 




nrUlN_lui I 


High Frequency Given Name Key (SET ID of the GN TYPE) 


HFGV 


High Frequency Given Name Variant Data Store 


HFGN.VAR 


High Frequency Given Name Variant Key 


HF 


High Frequency ..^ . ^ — 


HFP 


High Frequency Processor . 


HFS 


Hispanic Filter and Sorter 


HFSN_KEY 


High Frequency Surname Key (SET ID of the HFSN TYPE) 


HFSN_VAR 


High Frequency Surname Variant Key (ID NO of the HFSN.VAR) 


HFST 


High Frequency Surname Type Data Store — 


HFSV 


High Frequency Surname Variant Data Store 


HGl 


Hispanic Gender Identifier : _ — 


HGT 


Hispanic Given Name Type Data Store 


HNA-E 


Hispanic Name Search Algorithm for CLASS-E 


HNF 


Hispanic Name Formatter . , : : ! 


HNP 


Hispanic Name Preprocessor ^ 


HNT 


Hispanic Given Name Type Data Store ^ 


HPD 


Hispanic Parameter Data Store 


HR 


Hispanic Regularizaiion Rule Base 


HRE 


Hispanic Rule Engine ^ 


HSE 


Hispanic Search Engine : ; 


HSP 


Hispanic Segment Positioner ■ - 


HSS 


Hispanic Surname Segmenter . ^ 


HTD 


Hispanic TAQ Data Store 


HTP . 


Hispanic TAQ Processor 


ID_NO 


Identification Number for Segments in Data Stores ^ 


INITGN 


Given Name Initial Parameter Value J 
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INITNM 


No Maich Initial Parameter Value . 


INITSN 


Surname Initial Parameter Value . 


LFDIKEY 


Low Frequency Digraph Key in LFST . 


LFGT 


Low Frequency Given Name Type Data Store- „ 


LFP 




LFST 




LF DI THRESHOLD 


Low Frequency Digraph Threshold 


LNU 


Last Name Unknown ^ - — 


LTF 


Lineuisiic Trace Facility ^ 


M 


Male 


NLD'^s.-^'--. 


Name Length Determiner ^ . 


REMOVE 


A segment that is conjoined to the name stem is removed from the stem; it will then be 
marked for additional handling. DELETE or DISREGARD. 


RGNDR 


Record Gender . — 


RL# 


Refusal Code Level Category Number 


SET.ID 


Identification Number for Related Set of Name Vananis 


SEGMENT 


Name element surrounded bv white space 


SN 




SNTHR 


Surname Threshold (filter qualification) 


SN_INIT 




SN_VAL 




SPI 




TAG 


Titie/Affix/Qualifier 


TA0DIS#1 


TAO DISREGARD Comparand #1 


TAQDIS#2 


TAO DISREGARD Comparand #2 


U 


Unknown/Ambiguous Name Gender 


YOB 


Year-of- Birth 


YOB# 


Year-of-Birth Range Category Number 


YR 


Year-of-Birth Range Data Store 



2. PROCESS FLOW 

A Hispanic name is pre-processed and prepared for key generation. Prefixes are 
removed, certain name segments are moved, record gender is determined and other 
name characteristics are collected. 

The processor to which a name is submitted is dependent on the frequency of the 
surname, high frequency or low frequency. There are multiple entries into the High 
Frequency Processor, which means that low frequency names that are related to high 
frequency names can also be treated as high frequency names. 

The underlying principle behind the handling of high frequency names is that they 
retrieve a specified set of variants, all of which have pre-determined digraph values 
associated with them. This places the processing burden on adding records to the 
system and reduces the burden at the time of the query. Record retrieval criteria have 
been defined according to the values of the names and their relative positions in the 
query string; a query with high frequency names will, therefore, retrieve a smaller set 
of relevant names. The goal is to retrieve an adequate range of names as rapidly as 
possible. 
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Variants of low frequency names will be identified before retrieval based on matching 
digraph keys. The system will then retrieve exact matches on the set of low frequency 
names that pass a low frequency threshold, 

3. MODULE DECOMPOSITION 

3. 1 . HISPANIC NAME SEARCH ALGORITHM FOR CLASS-E MODULE 
DECOMPOSITION 

3.1.1. Identification 

->^^-... This program is known as the Hispanic Name Search Algorithm for CLASS-E 

(HNA-E) 
- 3.1.2. Type 

This program is a subprogram of the CLASS-E system and will process 
Hispanic names for both queries and record adds. 

3.1.3. Purpose 

HNA-E will process input names identified as Hispanic by the ANC-E using 
techniques that are appropriate for Hispanic names. No names with Last 
Name Unknown (LNU) will be processed by HNA-E. 

3.1.4. Function 

The Hispanic Name Search Algorithm for CLASS-E (HNA-E) consists of three 
program modules: 

• . the Hispanic Name Preprocessor (HNP). 

• the Hispanic Search Engine (HSE), and 

• the Hispanic Filter and Sorter (HFS). 

3.1.4.1. The HNP will manipulate an input name to generate search keys, 
generate additional query forms or alias record adds, calculate record 
gender, collect information about the input name and its name segments 
and determine the frequency path to which a name will be submitted for 
processing. 

3. 1 .4. 1 . 1 . The HNP will pass an input name to one of two processing 
paths: 

• the High Frequency Name Processor (HFP) or 

• the Low Frequency Name Processor (LFP). 

3.1.4.1.2. The HNP will generate a set of record criteria and search 
keys for retrieval of records from the database. 
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3.1.4.2. The HSE will build the retrieval keys, extract record information 
relevant to the retrieval and retrieve database records according to the 
keys and criteria identified. 

3.1.4.3. The HFS will evaluate the database records and will prepare an 
ordered set of records for return to the user. 

3.1:4.3.1. The HFS will qualify records based on filtering criteria and 
parameters. 

3.1.4.3.2. The HFS will sort the qualifying database records into an 
ordered list with the names most closely proximate to the 
query name at the top. 

3.1.5. Subordinates 

HNA-E consists of 3 major programming modules: (See Pages 7-10 for 
graphic representations of the processing flow of these modules.) 

• Hispanic Name Preprocessor (HNP), 

• Hispanic Search Engine (HSE). and 

• Hispanic Filter and Sorter (HFS). 
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3.2. HISPANIC NAME PREPROCESSOR MODULE DECOMPOSITION 

3.2.1. Identification 

This module is known as the Hispanic Name Preprocessor (HNP). 

3.2.2. Type 

The HNP is a subprogram of the HNA-E program that accepts input from the 
Advanced Name Classifier for CLASS-E (ANC-E) and prepares it for handling 
by the Hispanic Search Engine (HSE). (See Section 3.13.) 

3.2.3. Purpose 

Hispanic names account for almost 50% of the VLDB name records. In 
addition to the volume of occurrence, there are many names that occur very 
^fl^^. frequently. The formal of Hispanic names contributes further obstacles to 
name searching: the surname generally consists of two names and the given 
names generally consists of two names. The most highly frequently occurring 
prefix in the VLDB is also Hispanic: DE. The frequency, density and the 
nature of the name argue for preparing the name in whatever way(s) are 
necessary to expedite the retrieval process. That is the function of the HNP. 

3.2.4. Function 

3.2.4.1. The HNP will prepare a name identified as Hispanic by the ANC-E for 
■ the HSE by 

• identifying name segments and determining their disposition. 

• manipulating the name segments to generate additional query 
formats, 

• determining name length and record gender, 

• specifying the frequency character of each name segment and 

• generating search keys. 

3.2.4.2. Because of the significant amount of information that is to be 
generated and collected about the name through the HNP, it is strongly 
recommended that the name be treated as an object that "knows*' what 
sorts of information it needs. Such an object will provide a mechanism 
for following the acquisition of information as the object passes through 
the system. Much of that information will be collected and loaded 
during the HNP stage. 

3.2.5. Subordinates 

The HNP has ten subordinate functions: 

• Name Length Determiner (NLD) 

• Hispanic Surname Segmenter (HSS) 

• Hispanic TAQ Processor (HTP) 

• Hispanic Segment Positioner (HSP) 

• Segment Position Identifier (SPI) 

• Hispanic Name Formatter (HNF) 

• Hispanic Gender Identifier (HGI) 
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• Frequency Path Director (FPD) 

• High Frequency Name Processor (HFP) 

• Low Frequency Name Processor (LFP) 



3.3. NAME LENGTH DETERMINER MODULE DECOMPOSITION 

3.3.1. Identification 

This function is known as the Name Length Determiner (NLD). 

-^-^3:3.2. Type 

The NLD is a function that accepts as input a surname (SN) segment and stores 
the surname length. The length will be used by the Hispanic Surname 
Segmenter (Section 3.4). 

3.3.3. Purpose . 

Name segment length will provide information that will be used by the 
Hispanic Surname Segmenter to attempt to divide surnames over a specific 
length into component segments. 

3.3.4. Function 

3.3.4.1. The NLD will accept as input each SN segment. 

3.3.4.1.1. A segment is a string of characters surrounded by white 
space. 

3.3.4.1.2. The NLD will count the number of characters in.SN 
segment (not including surrounding blanks). 

3.3.4.1.3. The NLD will store the length count associated with each 
SN segment. 

3.3.5. Subordinates 
None. 

3.4. HISPANIC SURNAME SEGMENTER MODULE DECOMPOSITION 

3.4.1. Identification 

This function is known as the Hispanic Surname Segmenter (HSS). 

3.4.2. Type 

The HSS attempts to divide surnames over a specified length into component 
segments. The HSS is a function that must follow the NLD and precede the 
Hispanic TAQ Processor (Section 3.5). 

3.4.3. Purpose 

Hispanic names often have many segments and these segments may be quite 
long. Field lengths of fixed size may not be able to accommodate the number 
of name segments that occur. Data entry operators often attempt to reduce the 
name length by conjoining name segments... Conjoined segments have an 
especially negative impact on the surname. The access point into the database 
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is through the surname and conjoined name segments generally make the 
component segments inaccessible to processing. Separating conjoined 
surnames would, therefore, improve the search process. 
3.4.4. Function 

3.4.4.1. The HSS will separate conjoined HF SN segments from a surname 
segment of nine characters or more in length. 

3.4.4. 1.1. The HSS will generate additional query records for the 
separated SN segment and tag the items separated. 

3.4.4.1.2. The HSS will generate alias record adds for the separated SN 
'*'^*r-'^. name segments. 

3.4.4.2. The HSS will access the High Frequency Surname Type Data Store 
(HFST). 

3.4.4.2.1. Phase I: The HSS will begin with the leftmost character of 
the query/add SN segment and attempt to identify a 
HFSN.TYPE within the input SN string. 

3.4.4.2.2. The HSS will choose the longest HFSN_TYPE that it can 
identify, separate that string from the input string and proceed 
to Phase 2. 

3.4.4.2.3. Phase 2: The HSS will begin with the rightmost character 
of the query/add SN segment and attempt to identify a 
HFSN^TYPE (in reverse order) within the remaining input 
string (after any HFSN^TYPE has been removed during Phase 

1). 

3.4.4.2.4. The HSS will choose the longest HFSN.TYPE that it can 
identify and separate that string from the remaining input 
string. 

3.4.4.2.5. Any residual segment will be retained as is. 

3.4.4.2.6. If no HFSN.TYPE can be identified in either Phase, no 
action will be taken. 

3.4.4.2.7. An alias (or additional query) will be generated for the 
divided string. 



Figure 1. x^Aauipic. 111^^ 

INPUT NAME 


HFSN_TYPE 1 


& . . 

PHASE 1 


PHASE 2 1 


OUTPUT 


GARCIAGOMEZ 


GARCIA 
GOMEZ 


GARCIA 


GOMEZ 


GARCIA GOMEZ 


PEREZDELOPEZ 


PEREZ 
LOPEZ 


1 PEREZ 


DELOPEZ 


PEREZ DE LOPEZ 


BOMEZDEPEREZ 


PEREZ 


BOMEZDE 


PEREZ 


BOMEZDE PEREZ 


RAMIREZDELAPAZ 


RAMIREZ 
PAZ 


RAMIREZ 


DELAPAZ 


RAMIREZ DELA PAZ 
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3,4.5. Subordinates 
None. 

3.5. HISPANIC TITLE/AFFIX/QUALIFIER (TAQ) PROCESSOR MODULE 
DECOMPOSITION 

3.5.1. Identification 

-M^-'... This module will be known as the Hispanic Title/Affix/Qualifier Processor 

(HTP). 
- -3.5.2. Type 

The HTP is a process thai accepts a full Surname (SN) or Given Name (GN), 
accesses the Hispanic TAQ Data Store and reduces name fields with multiple 
segments to their name stems. 

3.5.3. Purpose 

Hispanic names frequently contain peripheral name elements, such as DE, DE 
LA, DEL, SAN. Matching on these segments is not generally useful; the 
name segments with information value are the name stems. For example, 
GARCIA is the more valuable segment in the string DE GARCIA, as is 
ANGELES in DE LOS ANGELES. Removal of or disregard for the 
peripheral name elements allows more emphasis to be placed on the name 
stems, thus improving the search process. 

3.5.4. Function 

3.5.4.1. The HTP will access the Hispanic TAQ Data Store (HTD) to identify 
TAQ segments: titles (e.g., SR., MR.), affixes (e.g.. DE) or qualifiers 
(e.g., PH.D., HIJO). 

3.5.4.1.1. The HTD will contain information about the disposition of 
the TAQ. 

3.5.4.1.2. The HTD will contain information about the type of TAQ 
(TAQ.TYPE): Title, Prefix, Infix, Suffix. Qualifier. 

3.5.4.2. The HTP will scan all SN segments or all GN segments for any TAQ 
segments. 

3.5.4.2.1. The HTP will begin with the leftmost character of the SN or 
GN field and attempt to identify a TAQ segment among the 
SN segments and among the GN segments. (The TAQ 
segment will be surrounded by white space.) 

3.5.4.2.2. If the HTP identifies a segment, it will tag the segment with 
the ID_NO and disposition, as indicated in the HTD. 

3.5.4.2.3. If the following segment is also a TAQ segment, it will tag 
the segment with the ID_NO and disposition, as indicated in 
the HTD. 
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3.5.4.2.4. This will continue until all consecutive TAQ segments have 

been tagged. 

3.5.4.2.5. When the HTP encounters a following segment that is not a 
TAQ segment, it will treat that segment as a stem. 

3.5.4.2.5.1. Each TAQ segment identified up to that point will be 
given the TAQ^TYPE P (prefix) and each will be . 
associated and stored with the following stem. 

3.5.4.2.6. The HTP will move to the next segment following the stem 

and will repeat the TAQ identification process. 

3.5.4.2.6. 1. The HTP will tag all TAQ segments with the ID_NO 
and disposition. 

3.5.4.2.6.2. When the HTP encounters a stem, it will tag each 
TAQ segment (not yet associated with a stem) with the 
TAQ_TYPE P and will associate and store each TAQ 
segment with the following stem. 

3.5.4.2.7. If HTP encounters a TAQ segment or segments that has no 
following stem, it will access the HTD to determine if the 
TAQ type is a Suffix (S). 

3.5.4.2.7. 1 . If the TAQ has a TAQ.TYPE S. the TAQ will be 
associated and stored with the preceding stem, 

3.5.4.2.7.2. The preceding stem may already have prcfixal 

TAQs. 

3.5.4.2.7.3. If the TAQ type is not equal to S, the TAQ will be 
tagged a Stranded Prefix. 

3.5.4.3. The HTP will process any TAQ segments identified according to the 
treatment indicated in the HTD. 

3.5.4.4. Treatment options include DELETE, DISREGARD and REMOVE. 

3.5.4.4. 1 . DELETE means that the segment is completely disregarded 
in the remainder of the name search process and contributes 
marginal information to the filtering process. (N.B. The 
segment is not deleted from the record.) 

3.5.4.4.2. DISREGARD means that the segment is disregarded in the 
remainder of the name search process but contributes to the 
evaluation of the name in the filtering processes. 

3.5.4.4.3. REMOVE means that a segment that is conjoined to the 
name stem is removed from that stem. It is then submitted to 
additional handling, either DELETE or DISREGARD. 

3.5.4.4.3.1. The HTP will begin with the leftmost character in 
the input stem segment (after free-standing TAQs have 
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been removed) and will attempt to identify all TAQ 
segments that have been marked for removal 
(REMOVE). 

3 5 4 4 3 2 The HTP will begin with the longest TAQ segment 

and attempt to remove that; it will then proceed to 

shorter segments. 
3 5 4 4 3 3 If the segment that is to remain after TAQ removal is. 

two characters or fewer, the HTP will not remove the 

TAQ. 

3 5 4.4.3.4. If the TAQ segment is identified *e resWual 

stem is of sufficient length, it is separated from the stem. 
3.5.4.4.3.5. The HTP assigns and stores the ID_NO of the 

removed TAQ. 

-« S 4 4 3 6 The HTP then submits the removed TAQ to the . 
;reattnent indicated (DELETE or DISREGARD) m the 
Sto "d tags and siores the TAQ with that treatment 
indicator. 



^'....r.o- Exam ple: TAQ REMO VE Process 

-■ 'P" ^Lr^ 1 TAQREMOVEl 

DE 



INPUT ST RgjG 
DECORDOBA 



nFT.QSANGELES 



DEARING 
MARIADE 
DELPILAR 



DELOS 



DE 
DE 



OUTPUT 
DE CORDQBa" 



r>FLQS ANGELES 



DE ARING 



MARIADE 
DEL PILAR 



Fipure3: Ex 
INPUT SN: 


imple: TAP Processing ■. , r 

TAQs and S'lKMS i 

tAo- rtF flD NO. REMnVE. DISREGARD) 


OUTPUT SN: 


DE 
LA 

CRUZ 

DE 

BARRIOS 


rTAO: LA(ID_N^ ocMOVE. DlSRhOARUj 1 

STEM' CRUZ 

TAP: DE (ID.Nn. REMOVE. DlSRhOARD) 

CTPM- RARRIOS - , ^ ^ , 

T 11 n - r> n fin NO ri*T°'='^'^PTY Stranded Prefix) 


CRUZ 
BARRIOS 


SAN 

3.5.5 


Subordinates 



None. 
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3.6. HISPANIC SEGMENT POSITIONER MODULE DECOMPOSITION 

3.6.1. Identification 

This function is Icnown as the Hispanic Segment Positioner (HSP). 

3.6.2. Type 

The HSP is a function that moves a high frequency (HF) surname (SN) found 
in the given name (GN) field into the SN field. 

3.6.3. Purpose 

Surnames that occur in the GN field deprive the match process of relevant SN 
information. Moving a SN segment that occurs in the GN field to the SN field 
will benefit the search process. (The SN segment is moved to the rightmost 
position to retain the value assigned to the resident SN segmenl(s).) 

3.6.4. Function 

3.6.4. 1 . If more than one GN segment (stem) occurs in the GN field, the HSP 
will determine if the final (rightmost) segment in the GN string is a HF 
SN. 

3.6.4.2. The HSP will move the segment to the SN field. 

3.6.4.2.1. If more than one GN segment occurs in the GN field, the 
HSP will access the High Frequency Surname Type Data 
Store (HFST) to determine if the rightmost GN segment is a 
HFSN^TYPE. 

3.6.4.2.2. If the segment is a HFSN.TYPE. the HSP will move the 
segment into the rightmost position of the SN field. 

3.6.4.3. The process applies to one name segment only and is not iterative. 



INPUT NAME 


HFSN TYPE 


OUTPUT FORMAT 


CASTRO, MARIA LUZ GOMEZ 


GOMEZ 


CASTRO GOMEZ. MARL\ LUZ 


BARRIOS LLFNA. JUAN PEREZ 


PEREZ 


BARRIOS LUNA PEREZ. JUAN 


LOPES ARRIAGA. CARLOS VFTRAL 




LOPES ARRIAGA. CARLOS VITRAL 



3.6.4.4. An additional query record is generated with the moved segment; the 
original record is not changed. 

3.6.4.5. An alias record add is generated with the moved segment; the original 

record is not changed. 

3.6.5. Subordinates 
None. 

3.7. HISPANIC NAME FORMATTER MODULE DECOMPOSITION 

3.7.1. Identification 

This module is known as the Hispanic Name Formatter (HNF). 
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3.7.2. Type 

3.7.2.1. The HNF is a process thai generates additional name formats for 
input records that have more than two surname stems. 

3.7.2.2. The HNF will follow the HSS, HTP, and HSP. 

3.7.2.3. The generated formats will serve as the name format for HP name 
processing and for comparison in the filtering and sorting process. 

3.7.3. Purpose 

The HNF will limit the number of segments that can occur in the surname 
field to two in order to maximize the efficient processing of the input name. 

-.,^;,3,7.4. Function 

3.7.5. The HNF will generate additional alias record adds and queries for 
surnames that contain more than two SN stems. 

3.7.5.1. The HNF will accept input strings with any number of SN segments 

(stems). 

3.7.5.2. When more than two SN segments are present, the HNF will generate 
additional name formats with a limit of two SN segments. 

3.7.5.2.1. The HNF will begin with the leftmost SN segment and 
generate dual-SN formats with each additional SN segment. 

3.7.5.2.2. The HNF will move to the second SN segment and generate 
dual-SN formats with each other SN segment that have not yet 
been generated. 

3.7.5.2.3. The relative order of all segments will be maintained. 

3.7.5.3. All generated formats will be stored with the record add. 

3.7.5.4. All generated formats will be additional queries. 



Figure 5: Example: Hispanic Name Formatter (HNF) 



X — , 

INPUT SURNAME 


GARCU 


LUNA 


BUSTOS 


ARRIAGA 


HNF DUAL-SN FORMATS 


GARCIA 


LUNA 








GARCIA 




BUSTOS 






GARCIA 






ARRIAGA 






LUNA 


BUSTOS 








LUNA 




ARRIAGA 








BUSTOS 


ARRIAGA 



3.7.6. Subordinates 
None. 
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3.8. SEGMENT POSITION IDENTIFIER MODULE DECOMPOSITION 

3.8.1. Identification 

This module is known as the Segment Position Identifier (SPI). 
•3,8.2. Type 

The SPI is a function that identifies the relative position of each of the SN and 
GN stems. The SPI must follow the HTP. HSP and HNF. Segment position 
information will be accessed by the High Frequency Processor (HFP) and the. 
Hispanic Filter and Sorter (HFS). 

3.8.3. Purpose 

Hispanic names generally contain more than one SN and more than one GN. 
The value of each of these name stems is different. In a SN, the leftmost stem 
is the family name; other SN stems are differentiators. The family name carries 
more value in the SN. In a GN, the leftmost name stem generally indicates 
gender so is a valuable indicator. Names that are in- and out-of position are 
therefore of differing relevance. Position information can contribute to the.^ 
selection and evaluation of relevant records. 

3.8.4. Function 

3.8.4.1. The SPI will operate on any SN or GN except where dual-SN formats 
have been generated. 

3.8.4.1.1. Where dual-SN formats have been created, the SPI will 
accept only those formats. 

3.8.4.1.2. The SPI will accept any number of GN segments. 

3.8.4.2. The SPI will specify the position in the name field (SN or GN fields) 
of each name segment. 

3.8.4.3. The SPI will begin with the leftmost segment and assign Position 1. 
proceeding to the next segment and assign Position 2, and so forth. 

3.8.4.4. Position information will be generated for and stored with each SN 
segment. 

3.8.4.5. Position information will be generated for and stored with each GN 

segment. 

3.8.5. Subordinates 
None. 

3.9. HISPANIC GENDER IDENTIFIER MODULE DECOMPOSITION 

3.9.1. Identification 

This function is known as the Hispanic Gender Identifier (HGI). 
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3.9.2. Type 

This is a function that associates a gender value with a record; it will be 
accessed by the Hispanic Decision Matrix (HDM) and the Hispanic Filler and 
Sorter (HFS). 

3.9.3. Purpose 

It is usually possible to predict the gender of a Hispanic name based on the 
gender marker of the leftmost given name segment. Because crossed-gender 
names are of little value in the visa adjudication pirocess, lowering the value of 
a record whose gender does not match that of a query would improve the name 

matching process. 

Predicting gender based on one source of gender, however, may result in 
elimination of records that differ by one character only. More than one source 
of gender information can provide a means of validating the gender 
assignment. This will be the record gender. Record gender will reduce the 
chance of qualifying or disqualifying a record based on the gender of a single 
name segment, which could be misspelled or ambiguous with respect to 
gender. 

3.9.4. Function 

3.9.4.1. The HGI will derive a gender that will be associated with a record 
and not a Given Name stem alone. 

3.9.4.2. A record gender value may be Male (M), Female (F), or 
Unknown/Ambiguous (U) Gender. 

3.9.4.2.1. The HGI will derive the record gender from the GN gender 
associated with each GN segment and the gender provided by 

the user during the data entry process. 

3.9.4.2.1.1. The HGI will access the Hispanic Given Name Type 
Data Store (HGT) to determine the gender associated 
with each GN segment. 

3.9.4.2. 1.1.1. If the name is present in the HGT, the 
gender indicated will be associated with the GN 
segment. 

3.9.4.2.1.1.2. If the name is not present in the HGT, the 
record gender will be marked as Unknown (U). 
(This would occur for a query with a name never 
before submitted to the system.) 

3.9.4.2.1.2, The applicant gender is determined at the time of 
application and must be entered, captured and stored by 

. the system. 
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3.9.4.2.2. The HGI will verify that all gender indicators agree: the 
gender associated with each GN segment and the applicant 
gender received at the time of application, 

3.9.4.2.2.1. To mark the record gender as M or F, the HGI 
requires gender validation from a minimum of two 
sources. 

3.9.4.2.2.2. All sources of gender information (whether two or 
more) must match for gender to be marked as M or F. 

3.9.4.2.2.2.1. If the gender indicators match, the match 
value will become the record gender. 

3.9.4.2.2.2.2. If the gender indicators do not match, 
gender is marked as U. 



Figure 6: Example: Record Gender Assignment 



GIVEN NAME 


HGT GNDR 


INPUT GENDER 


RECORD GENDER 


1) MARIA 


F 






LUZ 


F 






2) JOSE 


M 


M 


M 


ANTONIO 


M 






3) CARLOS 


M 


M 


U 


(DELA) CRUZ 


U 






4) BERNARDO 


M 


M 


M 


5) C AMEN (misspelling) 


(not in HGT) 


F 


U 


MARIA 


F 







3.9.5. Subordinates 
None. 



3.10. FREQUENCY PATH DIRECTOR MODULE DESCRIPTION 

3.10.1. Identification 

This module is known as the Frequency Path Director (FPD). 

3.10.2. Type 

3.10.2.1. The FPD directs a record to the High Frequency Processor or Low 
Frequency Processor depending on the presence or absence of HF 
surnames in the string. 

3.10.2.2. The FPD will access the following data stores: 

• High Frequency Surname Type Data Store (HFST) 

• Hispanic Character Data Store (HCD) 

3.10.3. Purpose 

Many Hispanic names occur with such high frequency that they would benefit 
from special processing. The system must determine which the high 
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frequency surnames are and direct records with high frequency surnames lo 
the proper handler. 
3.10.4. Function 

3.10.4.1. The FPD will accept any SN format, except where dual-SN formats 
have been generated by the Hispanic Name Formatter (HNF). 

3. 10.4. 1.1. The FPD will operate on the dual-SN formats where they 
have been generated. 

3.10.4.2. The FPD will identify, process and assign keys to SN initials. 

3.10.4.3. The FPD will identify and tag each SN stem as HF or LF. 

3.10.4.4. The FPD will assign HFSN.KEYs. where appropriate. 

3.10.4.5. The FPD will direct the record to the HF Processor or LF Processor 
depending on the frequency tags of the SN segments. 

3.10.4.6. The frequency-identification process will repeat until the frequency 
value of all SN segments has been identified. 

3.10.4.7. Surname Initials 

3.10.4.8. Record Adds 

3.10.4.9. The FPD will generate a SN.INIT Key for the initial character of 
each SN segment (The SN segment may be an initial). 

3.10.4.9.1. The FPD will access the Hispanic Character Data Store 
(HCD) to identify the SN_INIT Key. 

3.10.4.9.2. The FPD will find the initial character in the CHAR list. 

3.10.4.9.3. The FPD will assign the SN.INIT Key to the character. 

3.10.4.9.4. The SN_INIT Key will be the SET.ID for all occurrences 
of the character. 

3.10.4.9.5. The FPD will store the SN_INIT Key with the SN segment 
of the record. 

3.10.4.10. Query 

3.10.4.1 1. The FPD will identify single characters that occur in the SN field; 
any segment that has a name length of 1 (as specified by the Name 
Length Determiner (NLD)) is an initial. 

3.10.4.12. The FPD will access the Hispanic Character Data Store (HCD) to 
determine the SN_INIT Key(s) to assign to the segment. 

3.10.4.12.1. The FPD will find each instance of the character in the 
CHAR_VAR list. 

3.10.4.12.2. The FPD will assign SN.INIT Key(s) to the SN initial. 
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3.10.4.12.3. The SN_INIT KEY is the SET_ID associated with each 
instance of the initial. 

3. 1 0.4. 1 2.4. The SN initial may have multiple SN.INIT Keys. 

3.10.4.13. The FPD will ignore the SN_INIT Keys when determining the 
frequency path assignment of a record; the assignment will be based on 
the frequency of the other SN segment. 

3. 10.4. 14. High Frequency Surnames 

3. 10.4. 15. The FPD will access the High Frequency Surname Type Data 
Store (HFST). 

3. 10.4. 16. If a SN segment matches exactly a HFSN.TYPE in the HFST. the 
segment will be given the HFSN_KEY associated with the 
HFSN_TYPE. 

3.10.4.16.1. Record Add/Query: The HFSN.KEY will be the 
SET.ID associated with the HFSN_TYPE in the HFST. 

3.10.4.16.2. Record Add: A digraph value (DI.VAL) of 1 .00 will be 
assigned to and stored with the segment that matches a 
HFSN.TYPE exactly. 

3.10.4.16.3. The HFSN_KEY will represent a set of name segments 
that have qualified as digraph variants of the HFSN_TYPE. 
(See 3.12,4.38 for information on how the variants are . 
assigned to the same set.) 

3.10.4.17. The FPD will direct records that contain all HFSN_KEYs to the 
High Frequency Processor (HFP). 



INPUT NAME: 
GARCIA LOPEZ, ANTONIO JESUS 


FIELD 


HFSN.KEY 


DLVAL 


GARCIA 


SN 


0001 


1.00 


LOPEZ 


SN 


0004 


1.00 



ID.NO 


HFSN TYPE 


SET ID 


0001 


GARCIA 


0001 


0002 


RODRIGUEZ 


0002 


0003 


HERNANDEZ 


0003 


0004 


LOPEZ 


0004 


0005 


MARTINEZ 


0005 


0006 


GONZALEZ 


0006 


0007 


PEREZ 


0007 


0008 


SANCHEZ 


0008 


0009 


RAMIREZ 


0009 


0010 


GOMEZ 


.0010 


0011 




0011 
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3.10.4.18. The FPD will direct all records that do not contain all 
HFSN_KEYs to the Low Frequency Processor (LFP). 

3.10.5. Subordinates 
None. 

3.11. HIGH FREQUENCY PROCESSOR MODULE DECOMPOSITION 

3.11.1. Identification 

^v- This module is known as the High Frequency Processor (HFP). 

3.11.2. Type 

3. 1 1 .2. 1 . The HFP is a program module that 

• will process records with all HFSN„KEYs, HFSN.VAR Keys. 
SN_INIT Keys and mixed HF and LF Keys; 

• will generate Given Name Keys; and 

• will access the Hispanic Decision Matrix Data Store (HDM) to 
identify retrieval criteria for the HF records. 

3.11 .2.2. Multiple entry points into the HFP will be supported: through the 
Frequency Path Director and the Low Frequency Processor (LFP). 

3.11.3. Purpose 

Earlier attempts to develop a HF handler for Hispanic names have been 
limited to processing of records that contain only HF names; little to no 
variation was permitted. HNA-E will support variation in the processing of 
HF names by allowing multiple entry points into the HFP. 



HNA-E 

Language Analysts Systems. Inc. 



24 



03/ 19/98 



3.11.4. Function 

3.1 1.4.1. The HFP will accept names directed to the processor by the 
Frequency Path Director (FPD) and by the LFP. 

3.1 1.4.2. The records accepted will contain all HFSN_KEYs and/or 
HFSN.VAR Ke>LS. SN^INIT Keys, and mixed 
HFSN^KEY/HFSN^VAR and DLKEYs. 

3.11 .4.3. All SN^INIT Keys passed to the HFP will be treated as segment 
Keys and will undergo the same criteria identification as other segments. 

3.1 1.4.4. If all segments of the SN Field have been given HFSN.KEYs and/or 
HFSN.VAR Keys and related DI.VALs, the HFP will begin processing 
the GN segments. 

3.1 1.4.5. Processing the Given Name Segments 

3.1 1.4.6. If the GN segment is First Name Unknown (FNU), no GN 
processing will take place. 

3. 1 1 .4.7. High Frequency Given Name Segment Keys 

3.11.4.8. Record Adds 

3.1 1.4.9. The HFP will access the Hispanic Given Name Variant Data Store 
(HGNV) to determine if the GN segments are HF GN segments. 

3.11 .4.9. 1 . If the GN segment matches one or more variants in the 
HGNV, the HFP will assign the HFGN_KEY to the GN 
segment. 

3. 1 1 .4.9.2. The HFGN_KEY is(are) the SET JD(s) associated with 

the variant. 

3.11 .4.9.3. The HFP will associate the appropriate DLVAL with the 
SET_ID and GN segment. 

3.1 1.4.9.4. The HFP will store the SET_ID(s) arid their DI.VAL with 
the GN segment, 

3.1 1.4.9.5. The HFGN.KEY ensures that the system will retrieve 
variants of a HF segment when the HF segment is queried. 

3.11.4.10. Query 

3.11.4.11. The HFP will access the Hispanic Given Name Type Data Store 
(HGT). 

3. 1 1 .4. 1 1 . 1 . If a GN segment matches exactly a GN_TYPE name 

segment in the HGT and HLFREQ = 1 (is True) (that is, the 
segment is a HF GN.TYPE segment), the HFP will assign to 
the GN segment the HFGN.KEY associated with the 
GN.TYPE. 
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3. 11 .4.1 1.2. The HFGN_KEY will be the SET.ID associated with the 
HF GN.TYPE. 

3.1 1.4.12. High Frequency Given Name Initial Keys 

3. 1 1 .4. 1 3. The HEP will create one or more GN.INIT Keys for each HF GN 
segment. 

3. 1 1 .4. 14. The GN.INIT Key will be the initial key for each GN segment. 

including initials. 

3.11.4.14.1. Record Add 

3. 1 1 .4. 14.2. The HFP will identify the initial character of each GN 
segment. 

3.11.4.14.3. The HFP will access the Hispanic Character Data Store 
(HCD) and will find all occurrences of the character in the 
CHAR-VAR list. 

3.1 1.4.14.4. The HFP will assign the GN_INIT Key(s) to each GN 
initial. 

3. 1 1 .4. 14.4. 1 . The GN.INIT Key will be the SET_ID(s) 
associated with the GN initial (CHAR_VAR). 

3.1 1.4.14.4.2. The GN segment may have multiple GN_INIT 
Keys. 

3.1 1.4.14.4.3. The GN.INIT Key will permit retrieval of 
multiple initials for a GN initial. 

3.11 .4. 14.4.4. The HFP will store the GN_INIT Key(s) for each 
GN segment initial with the record. 

3.11.4.14.5. Query 

3.1 1.4.14.6. The HFP will access the HCD and find the initial in the 
CHAR list. • 

3.1 1.4.14.7. The HFP will identify the GN_rNIT Key for each GN 
segment initial. 

3.1 1 .4. 15. If the GN segment is not a variant in the HGNV, the HFP will tag 
the name as LF. 

3. 1 1 .4. 1 6. Low Frequency Given Name Segment Keys 

3.1 1.4.17. For each LF GN segment, the HFP will attempt to determine if the 
segment is a potential variant of a HF GN.TYPE and will create one or 
more GN.INIT Keys for record adds and queries. 

3.11.4.18. Record Add 

3.1 1 .4.19. If the LF GN is not in the HGNV Data Store, the HFP will 
determine if the segment is a potential variant of a HFGN_TYPE. (This 
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would apply to LF GN segments that are being submitted to the system 
for the first lime.) 

3.11.4.19.1. If the HFP determines that the HFGN segment is a 
variant of a HFGN_TYPE, the LFP will append the segment 
to the HFGV Data Store. 

3.1 1.4.19.2. The HFP will access the Hispanic Given Name Type Data 
Store (HGT) to determine if the LF GN segment is a digraph 
variant of one or more of the HF GN.TYPEs. (That is. the LF 
GN segment is a digraph variant of the GN.TYPE whose HF 
Value is True (1)). 

3.1 1.4.19.3. The LFP will perform a digraph evaluation of the LF GN 
and each HFGN^TYPE. 

3.1 1.4.19.4. The digraph value is determined in the following way: 

3.1 1.4.19.4.1. The digraphs are identified for each segment. 

3.1 1.4.19.4.2. Each pair of alphabetic characters is identified: 
CARA CA/AR/RA 

3.1 1.4.19.4.3. A digraph is also formed of the initial boundary 
(#) and the first alphabetic character: CARA -> #C. 

3.1 1.4.19.4.4. A digraph is also formed of the final alphabetic 
character and the final boundary (#): CARA-> A#. 

3.1 1.4.19.4.5. The number of shared digraphs is calculated. 

3.1 1.4.19.4.5.1. A digraph may match one digraph only. 

3.1 1.4.19.4.6. the number of shared digraphs is multiplied by 2 
and divided by the total number of digraphs in 
Comparand #1 added to the total number of digraphs in 
Comparand #2. 

3.11.4.19.4.6.1. The formula is: 

2 * d / a + b. 

where d = the total number of shared 
digraphs; 

where a = the total number of digraphs in 
Comparand #1; and 

where b = the total number of digraphs in 
Comparand #2. 

3.1 1.4.19.4.7. The result is the Digraph Value (DLVAL) for the 
two Comparands. 



Figure 9: Example: Digraph Calculation 
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COMPARANDS 


DIGRAPHS 


SHARED 
DIGRAPHS 


DLVAL 


COMPARAND #1: CARA 


#C CA AR RA A# 
(5 total dipraphs = a) 


#C CA AR A# 


2*d/a + b = 

8/ 12 


COMPARAND #2: CARINA 


#C CA AR RIINNAA# 
(7 total digraphs = b) 


= 4(d) 


0.67 



3. 1 1 .4. 19.4.8. This process is performed for each of pair of 
Comparands. 

3.1 1.4.19.5. To qualify for addition to the HFGV as a variant of one or 
more HFGN.TYPEs, the digraph value must pass a threshold, 
the High Frequency Given Name Variant Threshold (HFGV 
Threshold). 

3.1 1.4.19.5.1. The HFP will access the Hispanic Parameter Data 
Store (HPD) (Section 4.13) to determine the HFGV 
Threshold that the digraph value must pass for the LF 
GN to be appended to the HFGV Data Store. 

3.11.4.19.5.2. If the LFGN segment qualifies as digraph variant 
of one or more HF GN.TYPEs. the HFP 

• will append the LF GN to the HFGN.TYPEs to which 
it is related by entering the name into the HFGN_VAR 
list in the HFGV Data Store; 

• will assign the next available ID^NO. to the newly 
added HFGN^VAR; 

• will assign the SET_ID to the newly added 
HFGN.VAR that corresponds to the SET^ID of the 
HFGN...TYPE with which the new HFGN^VAR is 
associated; 

• will enter the digraph value into DLVAL; and 

• will store with the LF GN segment in the record the 
ID_NO(s) of the HFGN^VAR for each entry, the 
SET.ID of each HFGN^TYPE that is the parent of the 
HFGN_VAR, and the digraph value associated with 
each entry. 

3.1 1.4.20. Whether or not the LF segment is a variant of a HFGN.TYPE, the 
HFP will generate one or more GN^INIT Keys for the LF GN segment. 

3.1 1.4.20.1. The GNJNIT Key will be the initial key for each GN 
segment, including segments that are initials, 

3.11 .4.20.2. If the GN segment is FNU (First Name Unknown), no 
GN_INIT Key will be generated. 

3.11.4.20.3. Record Add 
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3.1 1.4.20.4. The HFP will identify the initial character of each GN 
segment. 

3.1 1.4.20.5. The HFP will access the Hispanic Character Data Store 
(HCD) and will find all occurrences of the character in the 
CHAR-VAR list. 

. 3.1 1.4.20.6. The HFP will assign the GNJNIT Key(s) to each GN 
initial. 

3.11 .4.20.6. 1 . The GN^INIT Key will be the SET JD(s) 
associated with the GN initial (CHAR.VAR). 

3.1 1.4.20.6.2. The GN segment may have multiple GN_INIT 
Keys. 

3.1 1.4.20.6.3. The GN^INIT Key will permit retrieval of 
multiple initials for a GN initial. 

3.1 1.4.20.6.4. The HFP will store the GN^INIT Key(s) for each 
GN segment initial with the record. 

3.11.4.20.7. Query 

3.11 .4.20.8. The HFP will access the HCD and find the initial in the 
. CHAR list. 

3.1 1.4.20.9. The HFP will identify the GN.INIT Key for each GN 
segment initial. 

3.1 1.4.20.10. The GN.INIT Key will be the SET^ID associated with 
the CHAR. 



Figure 10: Example: HFGN_K£Ys and GN_INIT Keys (Query) 



INPUT GN 


HF? 


HFGN.KEY 


GNJNIT 
KEYS 


MARIO 


T 


020 


078 (M) 


MICHAEL 


F 




078 (M) 


YSABEL 


F 




036 (Y. I) 


ZUSANA 


F 




002 (Z, S) 



INPUT GN 


HF? 


HF VARIANT? 


HFGN.TYPE 


HFGN.KEY 


GN_INIT KEYS 


MARIO 


T 




MARIO 


020 


078 (M) 


MICHAEL 


F 


F 






078 (M) 


YSABEL 


F 


T 


ISABEL 


203 


036 (Y, I) 


ZUSANA 


F 


T 


SUSANA 


436 


002 (Z. S) 
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3.1 1.4.21. The HFP will direct queries with all HF SN or mixed HP and LP 
SN (including SN JNIT Keys) and any GN Keys (HFGN.KEYs or 
GN_INIT Keys) or FNU to the Hispanic Decision Matrix (HDM) to 
determine the record retrieval criteria. 

3. 11. 4.2 1.1. Criteria for database retrieval include name content 

(whether the names are the same or different), the position of 
the name segments, the YOB range, the Refusal Code Level, 
Record Gender and additional restrictions based on the GN. 

3.1 1.4.22. Hispanic Decision Matrix 

3.1 1.4.23. The HFP will access the portion of the HDM that represents the 
number of HF SN segments in the query name, either one HF SN or two 
HFSNs. 

3.1 1.4.24. The HFP will identify and generate the set of SN formats possible 
for the number of SN segments in the query (one or two). 

3. 1 1 .4.24. 1 . The SN fonnats indicate 

• position of segments. 

• number of segments, and 

• other segments permitted. 

3.1 1.4.25. The HFP will identify the retrieval criteria in the HDM associated 
with each SN format. 

3.11 .4.25. 1 . The retrieval criteria include 

• Year-of-Birth Range 

• Refusal Level and 

• Record Gender 

3.1 1.4.26. GN Keys will be carried forward with the retrieval criteria. 

3.1 1.4.27. The HFP will send to the Hispanic Search Engine (HSE) the query 
format(s), all retrieval criteria associated with each query format and all 
SN Keys and all GN Keys generated for the query. 



Figure 12: Example: Hispanic Decision Matrix (Values for example only) 





Singh 


!-Segment SN 




Two-Seg 


ment S^ 




QUERY SN FORMAT 


A 


A 


A 




AB 


AB 


AB 


AB 


AB 


AB 


AB 


AB 


DATABASE SN FORMATS 


A 


AB 


BA 




AB 


BA 


A 


B 


AC 


CA 


CB 


BC 


YR 


5 


5 


2 




5 


4 


4 


2 


2 


0 


0 


0 


RL 


4 


4 


3 




4 


4 


4 


1 


1 


0 


0 


0 


RGNDR 


MFU 


MFU 


MFU 




MFU 


MFU 


MFU 


MFU 


FU 


MFU 


MFU 


MFU 



3.11.5. Subordinates 
None. 
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3.12. LOW FREQUENCY PROCESSOR MODULE DECOMPOSITION 

3.12.1. Identification 

This module is knowa 55 the. Low Frequency Processor (LFP). 

3.12.2. Type 

3. 12.2. L The LFP is a program module that will process names that contain 
SN segments identified by the FPD as Low Frequency SN segments (iie.. 
not found in the HFST Data Stores by the FPD). 

3. 12.2.2. The LFP will access the 

• High Frequency Surname Variant Data Store (HFSV) and 

• Low Frequency Surname Type Data Store (LFST). 

3.12.3. Purpose 

The LFP will process name segments that are identified as LF SN segments 
by the FPD. The LFP will determine 1) whether or not the LF segment is a 
variant of one or more HF SN and 2) whether or not the LF segment has 
variants among the LF segments listed in the Low Frequency Surname Type 
Data Store (LFST). The result of these two processes will be a list of 
segments to use as exact matches for retrieval. 

3.12.4. Function 

3.12.4.1. Genera! 

3.12.4.2. The LFP will accept from the FPD any record with a SN segment 
that has been tagged as a LF SN. 

3.12.4.2.1. The LFP will process records that contain only LF 
segments and all records that contain mixed HF and LF 
segments. 

3.12.4.2.2. The LFP will process records that contain one LF segment 

and SN_INIT Keys. 

3.12.4.2.2.1. Low frequency processing will be limited to the LF 
segment. 

3. 12.4.2.2.2. The SN_INIT Keys will contribute to the building 

of LF retrieval iceys. 

3.12.4.2.3. With mixed HF and LF SN, the LFP will process only the 
LF SN segment. (The HF segment will have been assigned a 
HFSN_KEY by the FPD). 

3.12.4.3. All LF segments in records that are sent to the LFP will be analyzed 
for both HF affiliations and LF variants. 

3.12.4.4. The LFP will attempt 

• to relate each LF SN segmentto one or more HFSN^TYPEs. 
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• to identify other LF SN segments related to the input LF SN 

segment(s), and 

• to append LF SN segments to the HFSV and LFST that have not 
been previously submitted to ihe system to the HFSV and LFST. 

3.12.4.5. The LFP will direct a query in which all LF SN segment(s) have 
been related to HFSN_TYPEs to the High Frequency Processor (HFP) 
(see Section 3.1 1) for generation of GN Keys and submission to the 
Hispanic Decision Matrix. 

3.12.4.5.1. The record may have the format HFSN^KEY (or SN.INIT 
Key) + HFSN_VAR Key, where the second key relates a LF 
SN segment to a HFSN_TYPE. 

3.12.4.5.2. The record may have the format HFSN^VAR Key + 
HFSN^VAR Key, where both segments are keys relating a LF 
SN segment to a HFSN^TYPE. 

3.12.4.5.3. The record may have the format HFSN_VAR Key. where 
the only segment is a key relating a LF SN segment to a • 
HFSN_TYPE. 

3.12.4.6. The LFP will direct a query record in which all LF SN segments 
have been related to other LFSN.TYPEs directly to the Hispanic Search 
Engine. (SN_ESIIT Keys may be present.) 

3. 12.4.7. The LFP will perform the following processes: 

• Access the HFSV Data Store to determine if the LF name segment is 
variant of a HFSN^TYPE. 

• Assign HFSN_VAR Key(s), as appropriate. 

• Generate LF_KEYs for LF SN variants identified in the LFS Data 
Store. 

• Perform a digraph comparison on the HFST Data Store to determine 
if a LF SN not in the HFSV Data Store is a digraph variant of a 
HFSN.TYPE segment. 

3. 12.4.8. The goal of the LFP, for a query with LF SN segments. Is to acvelop ■ 
a set of specific names related to the LF SN that wiirbe used as keys for 
record retrieval. 

3.12.4.9. Identifying Related High Frequency Surnames 

3.12.4.10. The LFP will access the HFSV Data Store to determine if each LF 
SN segment in the input name is a variant of HFSN_TYPE. 

3.12.4.10.1. The LFP will attempt to find all occurrences of the LF SN 
segment in the HFSN.VAR list. 

3.12.4.10.2. Record Add 
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3.12.4.10.3. If the segment is found in the HFSN_VAR list, the LFP 
will assign one or more HFSN_KEYs and HFSN.VAR Keys 
to theLFSN. 

3.12.4.10.3.1. The keys will be 

• the HFSN^KEY: the SET_ID associated with the 
HFSN_TYPE that is the parent of the 
HFSN.VAR and 

• the HFSN.VAR Key: the ID_NO of the 
HFSN.VAR. 

3.12.4.10.3.2. The digraph value associated with the 
HFSN_TYPE and HFSN_VAR pair will be retrieved 
and stored with the HFSN^KEY and HFSN.VAR Key 
as the DI.VAL. 

3.12.4.10.3.3. The LFP will store the HFSN.KEY, HFSN^VAR 
Key and the associated digraph value with the record 
segment. 

3.12.4.10.3.3.1. For example, if GARCA is the LF SN 
and is a variant of the HFSN^TYPE GARCIA, 
then GARCA will be given the SET_TD 
associated with the HFSN.TYPE GARCIA 
(0001) and the ID_NO that uniquely identifies 
GARCA (000137); 

3.12.4.10.3.3.2. The associated digraph value (0.77) will 
be stored with the LF SN GARCA as the 

DLV AL of 000 1 and 000 1 37. 

3.12.4.10.3.4. There may be multiple HFSN.KEYs and 
HFSN_VAR Keys associated with a single LF SN 
segment. 



QUERY SURNAME: 
PEREZ BOMEZ 


HF 
SN? 


HFSN.TYPE 


HF KEYS 


DIlVAL 








HFSN KEY 


HFSN.VAR KEY 




PEREZ 


T 


PEREZ 


0007 




1.00 


BOMEZ 


F 


GOMEZ 


0010 


016978 


0.67 



Figure 14: Piece of HFST Data Store 



HNA-E 

Language Analysis Systems. Inc. 



3.1 



03/19/98 



ID.NO 


HFSN_TYPE 


sET.ro 




0001 


GARCIA 


0001 




0002 


RODRIGUEZ 


0002 




0003' 


HERNANDEZ 


0003 




0004 


LOPEZ 


0004 




0005 


MARTINEZ 


0005 




0006 


GONZALEZ 


0006 




0007 


PEREZ 


0007 




0008 


SANCHEZ 


0008 




0009 


RAMIREZ 


0009 




0010 


GOMEZ 


0010 




0011 




0011 




Fieure 15: Piece of HFSV Data Store 






ID NO 


HFSN_VAR 


SET_ID 


DI_VAL 


032711 


PEREZ 


007 


1.00 


032712 


PERES 


007 


0.67 


032713 


PEREZA 


007 


0.77 


016976 


GOMEZ 


010 


1.00 


016977 


GOMES 


010 


0.67 


016978 


BOMEZ 


010 


0.67 



3.12.4.10.4. Query 

3.12.4.10.5. The LFP will attempt to associate the LP SN with one or 
more HFSN^TYPEs. 

3.12.4.10.5.1. The LFP will access the HFSV and determine if 
the LF SN is a variant of a HFSN^TYPE. 

3.12.4.10.5.2. If the LF SN is found in the HFSN^VAR list of 
the HFSV table, the LFP will assign a HFSN_VAR Key 
to the LF SN segment. 

3.12.4.10.5.2.1. The HFSN.VAR Key will be the 

ID_NO associated with the HFSN_VAR (and not 
the SET_ID that is associated with the 
HFSN.TYPE). . . 

3.12.4.10.5.2.1.1. The LF segment will be 
associated with the HF segment 
but with no other name segments 
in the same HFSN_TYPE class. 

3.12.4.10.5.2.1.2. That is, the variants 
associated with the HF segment 
are not related to one another 
through this process. 
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3.12.4.10.5.3. A LF SN may be a variant of multiple 

HFSN_TYPEs and may therefore receive multiple 
HFGN_VAR Keys. 

3.12.4.10.6. If, by virtue of this process, all LF SN segments in a 

query are set equal to HFSN.VAR Keys, the LFP will direct 
the query record to the HFP (see Section 3.1 1) for generation 
of Given Name Keys, submission to the High Frequency 
Decision Matrix (HDM) and identification of retrieval criteria. 



viau^-f.' FYamnle- Associating a LF SN Segment with HFSN_TYPE in a Query 


QUERY SURNAME: 
PEREZ BOMEZ 


HF 
SN? 


HFSN.TYPE 


HF KEYS 








HFSN.KEY 


HFSN_VAR KEY 


PEREZ 


T 


PEREZ 


0007 




BOMEZ 


F 


GOMEZ 




016978 



Figure 17: Piece of HFST Data Store 



ro_NO 


HFSN_TYPE 


SET_ID 




0001 


GARCIA 


0001 




0002 


RODRIGUEZ 


0002 




0003 


HERNANDEZ 


0003 




0004 


LOPEZ 


0004 




0005 


MARTINEZ 


0005 




0006 


GONZALEZ 


0006 




0007 


PEREZ 


0007 




0008 


SANCHEZ 


0008 




0009 


RAMIREZ 


0009 




0010 


GOMEZ 


0010 




0011 




0011 




Figure 18: Piece of HFSV Data Store 






ID_NO 


HFSN.VAR 


SET.m 


DLVALUE 


032711 


PEREZ 


007 


1.00 


032712 


PERES 


007 


0.67 


032713 


PEREZA 


007 


0.77 


016976 


GOMEZ 


010 


1.00 


016977 


GOMES 


010 


0.67 


016978 


BOMEZ 


010 


0.67 



3. 12.4. 10.7, The LFP will direct all queries and record adds to a LF 
analysis whether or not the LF SN .segment was identified as a 
variant of a HFSN.TYPE. 
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3.12.4.1 1. Identifying Related Low Frequency Surnames 

3.12.4.12. General 

3.12.4.13. AH records with one or more LF SN segments will undergo LF 
analysis by the LFP. 

3.12.4.14. For record adds, the LFP will assign an ID_NO that will be stored 
with the record. 

3.12.4.15. The LFP will generate LFDIKEYs for each LF SN segment in the 
query. 

"^^'^ 3. 1 2.4. 1 6. The LFP will use the LFDIKEYs to identify related LF SN 

segments. 

3.12.4.17. Note that the LFP will not generate LF Keys for the GN portion of 
the input name. 

3.12.4.18. LF SN Segment In LFST 

3.12.4.19. The LFP will determine if the LF SN segment of the input record is 
in the Low Frequency Sumame Type Data Store (LFST). 

3.12.4.20. Record Add: 

3.12.4.21. If the LF SN segment is a LFSN.TYPE in the LFST, the LFP will 
assign to and store with the LF SN segment the ID_NO associated with 
theLFSN.TYPE. 

3.12.4.22. Query: 

3.12.4.23. If the LF SN segment is a LFSN^TYPE in the LFST, the LFP will 
retrieve the (up to) 10 digraph keys (LFDIKEYs) that are associated with 
the name segment in the LFST. 

3.12.4.24. The LFP will use the LFDIKEYs retrieved from the LFST and the 
LFDIKEYs stored with all LFSN_TYPEs in the LFST Data to subset the 
LFST and to identify potential variants of the input LF SN segment. 

3. 1 2.4.25. Identifying LF Query Variants 

3.12.4.26. Phase 1: 

3.12.4.27. The LFP will subset the LFST Data Store. 

3.12.4.28. The LFP will select those names from the LFST that share a pre- 
determined set of LFDIKEYs. 

3.12.4.28.1. The LFP will determine the number of LFDIKEYs shared 
between each LFSN^TYPE and the LF query SN segment. 

3.12.4.28.2. The LFP will determine the Shared Key Value based on 
the number of shared digraphs. 

3. 12.4.28.2. 1 . The LFP will use the following formula to 

determine the Shared Key Value: multiply the number 
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of shared keys by two and divide by the total number of 
keys associated with each name: 

2 * [number of shared keys] / (total keys of 
Comparand #1 plus total keys of Comparand #2) 

3.12.4.28.3. The LFP will select only those LFSN^TYPEs whose 
Shared Key Value passes the LFDIKEY Threshold. 

3.12.4.28.3.1. The LFP will access the Hispanic Parameter Data 
Store to identify the minimum matching requirement for 

the Shared Key Value, the LFDIKEY Threshold. 

3. 12.4.28.3.2. For a segment to qualify for further processing, 
the Shared Key Value must pass the LFDIKEY 
Threshold found in the Hispanic Parameter Data Store 
(HPD). 



Figure 19: Example: Phase. I: LF Variants Related to a LF Query SN Segment; LFDIKEY 
Threshold = 0.40 ' 





ID.NO 




LFDIKEYs 


SHARED KEYS 


PASS LFDIKEY 
THRESHOLD 
0.40? 


QUERY 
NAME #1 




FLORENZAN 


FL1/FL2/L02/L01/L03/ 
ORB / 0R2 / 0R4 / RE4 / RE3 






LFSN.TYPE 


000189 


FLORENZAN 


FLI/FL2/L02/L01/L03/ 
0R3 / 0R2 / 0R4 / RE4 / RE3 


10 (All) 


2*10/20= 1.00 
YES 


LFSN^TYPE 


000232 


FLORESZ 


FLl /FL2/L02/L01/LO3/ 
0R3 / OR2 / 0R4 / RE4 / RE3 


10 (All) 


2*10/20= 1.00 
YES 


LFSN.TYPE 


000412 


LORENZ 


LOl /L02/OR2/OR1/OR3/ 
RE3/RE2/RE4/EN4/EN3 


5 (L02/OR2/OR3 
/RE3/RE4) 


2*5/20 = 0.50 
YES 


QUERY 
NAME #2 




TOREAT 


T01/T02/OR2/ORI/OR3/ 
RE3 / RE2 / RE4 / EA4 / EA3 






LFSN.TYPE 


000714 


TOREAT 


TO 1 / TO 2 / 0R2 / OR 1 / 0R3 / 
RE3 / RE2 / RE4 / EA4 / EA3 


10 (All) 


2*10/20= 1.00 
YES 


LFSN.TYPE 


000652 


THORET 


THl / TH 2 / H02 / HOI / H03 / 
0R3 / OR2 / 0R4 / RE4 / RE3 


4 (QR2/OR3/RE3 
/RE4) 


2*4/20 = 0.40 
YES 


LFSN.TYPE 


000776 


TOERO 


TO 1/ T02 / 0E2 / OEl / 0E3 / 
ER3/ER2/ER4/R04/R03 


2(T01/T02) 


2*2/20 = 0.20 
NO 



3.12.4.29. Phase 2: 

3.12.430. The LFP will perform a digraph comparison of each LF query SN 
segment that passed the LFDIKEY Threshold with each LFSN_TYPE. 

3.12.4.31. The digraph comparison will identify the set of names to be 
retrieved from the database. 

3. 1 2.4.3 1.1. See Section 3. 1 L4. 1 9.4 for the digraph analysis function 
and formula. 
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3. 12.4.3 1 2. The LFP will access the Hispanic Parameter Data Store to 
determine the LF^DI Threshold. 

3.12.4.31.3. For a segment to qualify for further processing, the 

digraph value must pass^the'LF_DI Threshold found in the 
Hispanic Parameter Data Store. 



Figure 20: Example: Digraph Filter of LFST Candidate SN Segments 



; . LF QUERY SN: FLORENZAN 


LFSN.TYPES PASSING 
LFDIKEY THRESHOLD 


DIGRAPH SCORE 


PASS LF.DI 
THRESHOLD; 0.57? 


FLORENZAN 


FLORENZAN 


2*10/20= 1.00 


YES 


FLORENZAN"-""" 


FLORESZ 


2*5/18 = 0.56 


NO 


FLORENZAN 


LORENZ 


2*5/17 =0.59 


YES 



3. 12.4.32. The LFP will assign a key (DI_KEY) to each LFSN„TYPE that 
passes the LF_DI Threshold. 

3.12.4.32.1, The DI_KEY will be the ID_NO associated with the 
LFSN.TYPE in the LFST. 

3.12.4.32.2. The DI_KEY will contribute to the building of the 
retrieval key. 

3.12.4.33. For a segment that passes the LF.DI Threshold, the LFP will retain 
the digraph score derived from the digraph evaluation and associate it 
with the appropriate DI_KEY. 

3. 1 2.4.34. Low Frequency Surname Segment Not in LFST 

3.12.4.35. Add: 

3.12.4.36. If the LF SN record add segment is not in the LFST, the LFP 

• will append the LF SN to the LFST as a LFSN_TYPE and assign 
the next ID_NO available; 

• will generate the LFDIKEYs for the new LFSN_TYPE and will 
add them to the LFST with the LFSN_TYPE; 

• will assign the ID_NO to the LF SN segment of added record and 

• will determine if the LF SN is a variant of a HFSN^TYPE and 
therefore should also be added to the HFSV Data Store, 

. 3.12.4.37. The LFP will append the LF SN segment and its LFDIKEYs to the 
LFST. 

3.12.4.37.1. The LFP will assign the next available ID^NO to the 
newly entered LF SN (LFSN.TYPE). 

3.12.4.37.2. The LFP will generate the LFDIKEYs to be associated 
with the LF SN segment (see 3.13.4.42 ). 

3. 12.4.37.3. The up-to-10 keys will be added to the LFST along with 
the LFSN_TYPE. ' 
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3.12.4.37.4. The LFP will assign the LFSN^TYPE ID_NO to the LF 
SN segment for storage with the record add. 

3.12.4.38. The LFP will determine if a LF SN segment that was not identified 
as a HFSN^VAR in the HFSV and that was not identified as a 
LFSN_TYPE in the LFST is a potential variant of a HFSN_TYPE. 

3.12.4.38.1. The LFP will access the High Frequency Surname Type 
Data Store (HFST) to determine if the LF SN segment is a 
digraph variant of one or more of the HFSN_TYPEs. 

3.12.4.38.1.1. The LFP will perform a digraph evaluation of the 
LF SN and each HFSN.TYPE. (See Section 3.1 1.4.19.4 

^ for details of the procedure and formula for performing a 
digraph evaluation.) 

3.12.4.38.1.2. To qualify for addition to the HFSV as a variant 
of one or more HFSN_T YPEs, the digraph value must 
pass a threshold, the High Frequency Surname Variant . 
Threshold (HFSV Threshold). 

3.12.4.38.1.3. The LFP will access the Hispanic Parameter Data 
Store to determine the HFSV Threshold that the digraph 
value must pass for the LF SN to be appended to the 
HFSV Data Store. 

3.12.4.38.2. If the LF SN segment is determined to be a digraph 
variant of one or more HFSN^TYPEs, the LFP 

• will append the LF SN to the HFSN.TYPEs to which it is 
related by entering the name into the HFSN_VAR list; 

• will assign an ID_NO to the newly added HFSN_VAR; 

• will assign the SETJD to the newly added HFSN_VAR 

. that corresponds to the S ETJD of the HFSN^T YPE with 
which the new HFSN.VAR is associated; 

• will enter the digraph value into DLVAL; and 

• will store with the LF SN segment in the record add the 
ID_NO of the HFSN_V AR for each entry, the SET_ID of 
each HFSN^TYPE that is the parent of the HFSN.VAR 
and the DI.VAL for each relationship. 

3.12.4.39. Query • 

3.12.4.40. If the LF SN query segment is not in the LFST, the LFP 

• will generate the LFDIKEYs for the new LF SN; 

• will select the related LFSN^TYPEs through the LF selection 
process; and 

• will determine if the LF SN is a variant of a HFSN^TYPE, assign 
appropriate keys and retain related digraph values. (See Section 
3.12.4.43). 
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3.12.4.41 . The LFP will generate the LFDIKEYs for the LF SN segment (see 
Section 3.12.4,44). 

3.12.4.42. The LFP will identify LFSN_TYPEs in the LFST that are variants^ 
of the LF. (See Section 3.12.4.1 1 for the identification process.) 

3.12.4.43. The LFP will determine if a LF SN segment that was not identified 
as a HFSN_VAR and that was not found in the LFST is a potential 
variant of a HFSN_TYPE. 

3.12.4.43.1. The LFP will access the HFST Data Store and perform a 
digraph evaluation between the LF SN and each 
HFSN_TYPE. (See Section 3.1 1.4.19.4 for details of the 

^"^"^ procedure and formula for performing a digraph evaluation.) 

3.12.4.43.2. The digraph value must pass a threshold for the LF SN to 
be considered a variant of a HFSN_TYPE(s), the High 
Frequency Surname Variant Threshold (HFSV Threshold). 

3.12.4.43.3. The LFP will access the Hispanic Parameter Data Store to 
determine the HFSV Threshold that the digraph value must 
pass for the LF SN to qualify as a variant of a HFSN_TYPE. 

3.12.4.43.4. If the LF SN segment passes the HFSV Threshold, the 
LFP will assign HFSN_VAR Key(s) to the LF SN segment. 

3.12.4.43.4.1. The HFSN_VAR Key will be the ID_NO 
associated with the HFSN_VAR that is equal to the 
HFSN^TYPE. 

3.12.4.43.4.1.1. That is, the LF SN segment will be 
associated with the parent HFSN_TYPE only. 

3.12.4.43.4.1.2. A LF SN may be a variant of multiple 
HFSN.TYPEs and may therefore receive 
multiple HFSN.VAR Keys. 

3.12.4.43.4.2. The calculated digraph value will be associated 
with each HFSN_VAR Key. 

3.12.4.43.5. If. by virtue of this process, all LF SN segments in a 
query are set equal to HFSN_VAR Keys, "the LFP will direct 
the query record to the HFP (Section 3.11.) for generation of 
Given Name Keys, submission to the High Frequency 
Decision Matrix (HDM) and identification of retrieval criteria. 

3.12.4.44. Generating LFDIKEYs 

3.12.4.44.1. TheLFDIKEYis 

1) a set of digraphs formed from the LF SN segment 

beginning with the leftmost character and 

2) a set of positional variants on those digraphs. 
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3.12.4.44.2. Positional information will be associated with each 
digraph. 

3.12.4.44.3. Base Keys 

3.12.4.44.4. The LFP will begin with the leftmost character and 
generate up to four digraph keys (Base Keys) from the (up to) 
five leftmost characters of the LF SN segment. 

3.12.4.44.4.1. The first two characters form a digraph, the 
second and third characters form a digraph, the third and 
fourth characters form a digraph and the fourth and fifth 
characters form a digraph. 

3.12.4.44.4.2. Positional information will be included: 1,2,3,4, 
respectively: DIl, DI2. DI3, DI4. 

3. 12.4.44.5. If the LF SN segment has fewer than five characters, the 
LFP will generate fewer than four Base Keys, up to the 

number of characters in the LF SN. 

3.12.4.44.6. Positional information will be included. 

3.12.4.44.7. Position Keys 

3.12.4.44.8. The LFP will generate from the Base Keys up to six 
additional Position Keys from the Base Keys. 

3.12.4.44.8.1. A maximum often keys (Base + Position) will be 
generated. 

3.12.4.44.8.2. The Position Keys have the same characters as the 
Base Keys but contain different positional information. 

3.12.4.44.8.3. For segments with 5 or more characters: 

3.12.4.44.8.4. The LFP will produce a Position Key on the first 
Base Key with Position 2. 

3.12.4.44.8.5. The LFP will produce Position Keys on the 
second Base Key with Position 1 and Position 3. 

3. 12.4.44.8.6. The LFP will produce Position Keys on the third 
Base Key with Position 2 and Position 4. 

3.12.4.44.8.7. The LFP will produce a Position Key on the 
fourth Base Key with Position 3. No Position Key is 
generated for Position 5 because the maximum of 10 
keys has been reached. 

3.12.4.44.8.8. For segments with fewer than 5 characters: 

3.12.4.44.8.9. The LFP will produce Position Keys in the same 
way as for longer LF SN segments. 
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3.12.4.44.8.9.1. No Position Key will be generated for 
the final Base Key with a position to the right of 

the final character. 

3.12.4.44.8.9.2. The total number of LFDIKEYs will be 
fewer than with a longer LF SN segment. 

3.12.4.44.8.9.3. In GOMA, the LFP will generate a total 
of 7 keys: the Base Keys GOl , 0M2 and MA3, 
and the Position Keys, 002. OMl, 0M3, and 
MA2. (Note: No MA4 Position Key is produced 
for the final digraph.) 



Figure 21: Example: LFDIKEYs for LF SN Segnients 



LFSN SEGMENT 


LFDIKEYS: BASE KEYS 


LFDIKEYS: POSITION KEYS 


CARRIOS 


CA1/AR2/RR3/RI4 


CA2 / AR 1 / AR3 / RR2 / RR4 / RI3 


BALA 


' BAI/AL2/LA3 


BA2/ ALl / AL3/LA2 



3.12.4.44.9. Building LF Retrieval Keys (Query) 

3.12.4.44.10. General 

3. 12.4.44. 1 1 . Each LF SN segment has been assigned a DI_KEY or 
setofDLKEYs. 

3.12.4.44.12. A LF SN segment may also have been assigned one or 
more HFSN.VAR Keys. 

3.12.4.44.13. The LFP has sent queries with all HFSN.KEYs and/or 
HFSN.VAR Keys (including SN.INTT Keys) to the HFP for 
further processing. 

3.12.4.44. 14. The LFP will build sets of retrieval keys for mixed 
frequency queries (at least one HF key and one LF SN key in 
the string) and for queries with all low frequency keys (all SN 
in the string must be DLKEYs). 

3.12.4.44.14.1. A single query may have various formats - all 
HF, mixed and/or all-LF SN depending on the results of 
LF processing prior to this stage. 

3. 1 2.4.44. 1 5. Mixed frequency queries will not contain SN JNIT 
Keys. 

3.12.4.44.15.1. SN.INIT Keys may occur with HF SN keys, in 
which case the record will be treated as an all-HF SN 
record. 

3.12.4.44.15.2. SNJNIT Keys may occur with LF SN keys, in 
which case the record will be treated as an all-LF SN 
record. , 
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3.12.4.44.16. Queries with Mixed Frequency (HF + LP) Surnames 



3.12.4.44.17. Type 1: 

3.12.4.44.18. If one SN in the query is a HF SN and has an associated 
HFSN_ICEY and one SN in the query record is a LF SN 
segment and has associated DI_KEYs, the LFP will build a 
Mixed Key of the HFSN.KEY and each DLKEY (and the 
associated DLVALs). 

3. 1 2.4.44. 18.1. The HFSN_KEY represents a set of variants of 
one HFSN.TYPE. 

3. 1 2.4.44. 18.1.1. GARCIA. GARCA, GARZA are all 

- ' digraph variants of the HFSN.TYPE GARCIA, 

which has the SET_1D 0001. 

3.12.4.44.18.1.2. Record adds and queries will already 
have been assigned the HFSN_KEY through the 
HFP. 

3.12.4.44.18.2. The DLKEY represents a «ng/e low frequency 
surname type that has qualified through the LF SN 
selection process. 

Figure 22: Example: Building Mixed HF/LF SN Retrieval Keys with HFSN^KEY and 
DLKEYs 



QUERY NAME: GARCIA FLORENZAN 


HFSN KEY and (LF) DI_KEY 


GARCIA (HF) 
FLORENZAN (LF) 


GARCIA 001 
FLORENZAN ^ 000189 


GARCIA (HF) 
LORENZCLF) 


GARCIA 001 
LORENZ-^ 000412 



3.12.4.44.19. Type 2: 

3.12,4.44.20. If one SN in the query record is a HF SN and has an 

associated HFSN.VAR Key (generated by the LFP) and one 
SN in the query record is a LF SN segment with associated 
DI_KEY(s), the LFP will build a Mixed Key of the 
HFSN_VAR Key and the DLKEY for each qualifying 
LFSN_TYPE. 

3.12.4.44.20.1. The HFSN.VAR represents a stn^/e 
HFSN.TYPE and not a set of variants. 

3.12.4.44.20. 1.1. Record adds and queries will already 
have been assigned the HFSN_VAR Key through 
theLFP: ' 
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3.12.4.44.20.2. The DI_KEY represents a si/ig/e low frequency 
surname type that has qualified through the LF SN 
selection process. 

Figure 23: Example: Building Mixed HF/LF SN Retrieval Keys with HFSN.VAR Keys 

and DI_KEYs 





QUERY NAME: GARCIA FLORENZAN 


HFSN VAR and (LF) DI_KEY 




BOMEZ (LF -> HFSN.VAR) 


BOMEZ ^ 016978 




FLORENZAN (LF) 


FLORENZAN-^ 000189 




BOMEZ (LF -> HFSN.VAR) 


BOMEZ 016978 




LORENZ(LF) 


LORENZ-^ 000412. 







3.12.4.44.21. It is likely that there will be multiple DLKEYs for each 
LF SN segment, resulting in multiple Mixed Keys. 

3.12.4.44.22. Once the Mixed SN Keys have been generated (and. 
DLVALs associated with the appropriate keys), the LFP will 
send any query that contains mixed HF and LF Keys to the 
HFP (Section 3 . 11 .4.5) for Given Name processing and . 
identification of retrieval criteria from the Hispanic Decision 
Matrix. 

3.12.4.44.23. Queries with All Low Frequency (LF + LF) 
Surnames 

3.12.4.44.24. The LFP will identify the LF Keys associated with query 
formats made up solely of LF SN segments (or a LF SN 
segment and SN.INIT Key(s)). 

3.12.4.44.24. 1 . The LFP has qualified one or more LF SN 
segments from the LFST as variants of each LF query 

SN. 

3.12.4.44.24.2. Each qualifying LF segment has been assigned a 
LF Key, the DLKEY, and has an associated digraph 
value, DLVAL. 



Figure 24: Example: Low Frequency PI KEYs and Associated Digraph Values 



QUERY NAME: TOREAT FLORENZAN 


LFST ID_NO 


DI.KEYs + DLVAL 


TOREAT 


TOREAT^ 000714 


000714(1.00) 


THORET 


THORET -> 000652 


000652 (0.57) 


FLORENZAN 


FLORENZAN 000189 


000189(1.00) 


FLORESZ 


FLORESZ 000232 


000232 (0.56) 


LORENZ 


LORENZ -> 000412 


000412(0.59) 



3.12.4.44.25. The LFP will direct a query record with all DLKEYs 
and their digraph values (or DLKEYs and SN JNIT Keys) to 
the Hispanic Search pngine for retrieval of database records. 
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3.12.5. Subordinates 

None. 

3.13. HISPANIC SEARCH ENGINE MODULE DECOMPOSITION 

3.13.1. Identification 

This module is known as the Hispanic Search Engine (HSE). 

3.13.2. Type 

3. 1 3.2. 1 . The HSE is a function that applies to queries only. 

^i^--*^ 3. 13.2.2. The HSE will accept name keys and retrieval criteria from the HFP 
and the LFP, 

3. 13.2.3. The module must follow the HFP and LFP. 

3.13.3. Purpose 

The HSE will retrieve records from the VLDB based on criteria identified by 
the High Frequency Processor and the Low Frequency Processor, These 
criteria will delimit the set of records that can qualify for retrieval. The 
system must be sure that the criteria have all been identified and can be 
associated with database records (whether through database design and/or key 
generation). 

3.13.4. Function 

3.13.5. HNA-E will not handle records with Last Name Unknown (LNU). 

3.13.6. The HSE will permit First Name Unknown (FNU). 

3.13.6.1. The processing of FNU will supersede other GN restrictions. 

3.13.6.2. The HSE will retrieve any database GN when FNU occurs in the 

query. 

3. 13.6.3. The HSE will retrieve any FNU in the database for any query GN. 

3.13.7. High Frequency Retrieval 

3.13.7.1. High frequency retrieval will include records with HFSN^KEYs, 
HFSN_VAR Keys and SN.INIT Keys that occur with the HF SN keys. 

3.13.7.1.1. The SN^^INIT Key will result in the retrieval of records 
that begin with or are equal to the variant initials identified by 
the SN.INIT Key. 

3.13.7.1.2. The SN_INIT Key is stored with each SN segment. 

3.13.7.1.3.. All HF retrieval restrictions apply to the SN.INIT Key, as 
if it were a HF segment except that 

3.13.7.1.3.1. If the SN_INIT Key is the only key in the format, 
the HSE will not undertake a database search. 

3. 1 3.7. 1 .4. No' further, separate detailing of the SNJNIT Key is given. 
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3.13.7.2. High frequency retrieval will include mixed HF and LF SN Keys but 
with no SN.INIT Keys. 

3.13.8. All HFSN.KEYs (or SNJNIT Key) 

3.13.9. For queries with all HFSN.KEYs, the HSE will retrieve records from the 
database records that 

• contain the appropriate SN format (position, more/fewer segments, 
different segments) as specified in the HDM, 

• contain the appropriate HFSN_KEYs, 

• meet all the criteria identified in the HDM and 

• meet the GN restrictions. 

3.13.10. The HFSN.KEY will result in retrieval of the HFSN^TYPE and all its 
variants. 

3.13.10.1. The HSE will further restrict the retrieval to records that match at 
least one key of the GN. 

3. 13. 10. 1 . 1 . If the query has produced only HFGN.KEYs. only 
records that have at least one of the HFGN.KEYs will be 
retrieved, 

3.13.10.1.2. If the query has produced mixed HFGN^KEYs and 
GN_INIT Keys, the HSE will retrieve records that match at 
least one of the HFGN.KEY or one of the GN_INIT Keys. 

3.13.10.1.3. If the query has produced only GiN.INFT Keys, the HSE 
will retrieve records that match at least one of the GN_INIT 
Keys. 



Figure 25: Example: Record Matching Criteria: All HFSN^KEYs and HFGN^KEYs 



QUERY #1 


RODRIGUEZ 


LOPEZ 


JOSE 


CARLOS 


CRITERIA 


HFSN.KEY 


002 


010 








HFGN_KEY 






0001 


0007 




HDM FORMATS: 












1 


RODRIGUEZ 
(002) 


LOPEZ (010) 






YOB5. RL4. MFU. GN 
contains 0001 or 0007 


2 


LOPEZ 


RODRIGUEZ 






Y0B4, RL4. MFU. GN 
.contains 0001 or 0007 


3 


RODRIGUEZ 








Y0B4, RL4. MFU. GN 
contains OOOl or 0007 


4 


LOPEZ 








Y0B2. RLl.MFU, GN 
contains 0001 or 0007 


5 • 


RODRIGUEZ 


♦ (ANY SN) 






Y0B2. RLI.FU, GN 
contains 0001 or 0007 


6 


LOPEZ 


* (ANY SN) 






YOBO. RLO, MFU. GN 
contains 0001 or 0007 


7 


* (ANY SN) 


RODRIGUEZ 






YOBO. RLO. MFU, GN 
contains 0001 or 0007 


8 


* (ANY SN) 


LOPEZ 






YOBO. RLO. MFU. GN 
contains 0001 or 0007 
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Figure 26: Example: Record Matching Criteria: All HFSN.KEYs and Mixed 
HFGN^KEYs and GN_INIT Keys ^_ 



QUERY #2 


RODRIGUEZ 


LOPEZ 


JOSSE 


CARLOS 


CRITERIA 


HFSN_KEY 


002 


010 








HFGN.KEY 








0007 




GN.INIT Key{s) 






041 (J. H) 






HDM FORMATS: 












1 


RODRIGUEZ 

(002) 


LOPEZ (010) 






YOBS. RL4. MFU, GN 
initial = J or H; or Gf^- 0007 


2 


LOPEZ 


RODRIGUEZ 






YOB4, RL4, MFU, GN 
initial = J or H; or GN= 0007 


3 


RODRIGUEZ 








Y0B4, RL4. MFU, GN 
initial = J or H; or GN= 0007 


4 


LOPEZ 








Y0B2, RL1,MFU. GN 
initial = J or H; or GN= 0007 


5 


RODRIGUEZ 


* 






Y0B2. RLl.FU. GN initial 
= J or H; or GN= 0007 


6 


LOPEZ 


• 






YOBO, RLO. MFU. GN 
initial = J or H; or GN= 0007 


7 


« 


RODRIGUEZ 






YOBO, RLO. MFU, GN 
initial = J or H; or GN= 0007 


8 


* 


LOPEZ 






YOBO. RLO, MFU, GN 
initial = J or H; or GN= 0007 



3. 13.1 1. HFSN^KEY and/or HFSN^VAR Keys (or SNJNIT Key) 

3.13.12. For queries with mixed HFSN.KEYs and HFSN.VAR Keys and queries 
with all HFSN^VAR Keys, the HSE will retrieve records from the database 
records that 

• contain the appropriate HFSN^KEYs and/or HFSN_VAR Keys, 
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• contain the appropriate SN format (position, more/fewer segments, 
different segments) as specified in the HDM, 

• meet all the criteria identified in the HDM and 

• meet the GN restrictions. 

3.13.13. The HFSN.VAR Key will retrieve a single HFSN_TYPE and not the set 
of variants associated with the HFSN^TYPE (e.g., the name LOPEZ but not 
all its variants; the variants will be retrieved by the LFP). 

3.13.13.1. The HSE will further restrict the retrieval to records that match at 
least one key of the GN. 

3.13.13.1.1. If the query has produced only HFGN_KEYs, only 
records that have at least one of the HFGN_KEYs will be 
retrieved. 

3.13.13.1.2. If the query has produced mixed HFGN_KE Ys and 
GN.INIT Keys, the HSE will retrieve records that match at 
least one of the HFGN.KEY or the GN.INIT Keys. 

3.13.13.1.3. If the query has produced only GN.INIT Keys, the HSE 
will retrieve records that match at least one of the GN JNIT 
Keys. 



Figure 27: Example: Record Matching Criteria: All HFSN^KEY and/or HFSN_VAR 
Keys ' 



OUERY #1 


RODRIGUEZ 


SLOPEZ 


JOSSE 


CARLOS 


CRITERIA 


HFSN KEY 


002 










HFSN.VAR Key 




00976 








HFGN.KEY 








0007 




GN_INITKey(s) 






041 (J, H) 






HDM FORMATS: 












I 


RODRIGUEZ 
(002) 


LOPEZ 
(000976) 






YOBS, RL4. MFU, GN 
iniual = JorH;or GN= 0007 


2 


LOPEZ 


RODRIGUEZ 






Y0B4. RL4. MFU. GN 
initial = J or H; or GN= 0007 


3 


RODRIGUEZ 








Y0B4. RL4, MFU. GN 
initial = J or H; or GNs 0007 


4 


LOPEZ 








.Y0B2. RLl.MFU. GN 
initial = J or H; or GN= 0007 


5 


RODRIGUEZ 


* 






Y0B2, RLl.FU. GN initial 
= J or H; or GN= 0007 


6 


LOPEZ 


* 






YOBO. RLO. MFU, GN 
initial = J or H; or GN= 0007 


7 




RODRIGUEZ 






YOBO. RLO. MFU. GN 
initial = J or H: or GN= 0007 


8 


* 


LOPEZ 






YOBO, RLO, MFU. GN 
initial = J or H; or GN= 0007 
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3.13.14. Mixed HFSN.KEY and/or HFSN_VAR Keys and LF DI.KEYs (no 
SN JNIT Key) 

3.13.15. For queries with mixed HFSN^KEYs/HFSN.VAR Keys and LF 
DLKEYs, the HSE will retrieve records from the database records that 

• contain the appropriate HFSN_KEYs/HFSN_VAR Keys and DLKEYs. 

• contain the appropriate SN format (position, more/fewer segments, 
different segments) as retrieved from the HDM. 

• meet all the criteria identified in the HDM and 

• meet the GN restrictions. 

3.13.16. The LFP generaTe'd (multiple) query formats that contain a HF Key and a 
DI_KEY. 

3. 1 3. 16. 1 . The DI_KEY will retrieve an exact match on a single 
LFSN.TYPE. 

3.13.16.2. Each HFSN_KEY or*HFSN_VAR Key may participate in query 
formats with several different DLKEYs that were identified as variants 
by the LFP. 

3.13.16.3. Each query format will serve as a different query. 

3. 13. 17. The HSE will further restrict the retrieval to records that match at least 
one key of the GN. 

3.13.17.1. If the query has produced only HFGN_KEYs, only records that 
have at least one of the HFGN^KEYs will be retrieved. 

3.13.17.2. If the query has produced mixed HFGN^KEYs and GN.INIT 
Keys, the HSE will retrieve records that match at least one of the 
HFGN_KEY or the GN^INIT Keys. 

3.13.17.3. If the query has produced only GN_INIT Keys, the HSE will 
retrieve records that match at least one of the GN_INIT Keys. 
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Figure 28: Example: Record Matching Criteria: Mixed HFSN_KEYs/HFSN_VAR 
Keys and LP DI_KEYs ' 



QUERY #1 


THORET 


SLOPEZ 


JOSSE 


CARLOS 


CRITERIA 


HFSN.KEY 
















00976 










000652 (THORET) 












UUU / 1*1 \ 1 \Jt\.C.n. 1 } 










HFGN_KEY 








0007 




GN_INIT Key(s) 












HDM FORMATS: 












I 


THORET 


LOPEZ 
(000976) 






Vnns Til A MPT \ fSM 

I \Jtyjt Kid*r, mrwi ui^ 
initial = J or H; or GN= 0007 


2 


LOPEZ 


THORET 






Y0B4, RL4. MFU. GN 
initial = J or H; or GN= 0007 


3 


THORET 








YOB4, RL4. MFU, GN 
initial = J or H; or GN= 0007 


4 


LOPEZ 








Y0B2, RLl.MFU, GN 
initial = J or H; or GN= 0007 


5 


THORET 


* 






Y0B2. RLLFU. GN initial 
= J or H; or GN= 0007 


6 


LOPEZ 


* 






YOBO, RLO, MFU. GN 
initial = J or H; or GN= 0007 


7 


* 


THORET 






YOBO, RLO, MFU. GN 
initial = J or H; or GN= 0007 


8 


* 


LOPEZ 






YOBO. RLO. MFU. GN 
initial = J or H; or GN= 0007 


I 


TOREAT 


LOPEZ 






YOBS, RU. MFU. GN 
initial = J or H; or GN= 0007 


2 


LOPEZ 


TOREAT 






Y0B4. RL4, MFU. GN 
initial = J or H; or GN= 0007 


3 


TOREAT 








Y0B4. RL4. MFU. GN 
initial = J or H; or GN= 0007 


4 


LOPEZ 








Y0B2. RLl.MFU. GN 
initial = J or H; or GN= 0007 


5 


TOREAT 


• 






Y0B2. RLl.FU. GN initial 
= J or H; or GN= 0007 


6 


LOPEZ 


* 






YOBO. RLO. MFU. GN 
initial = J or H; or GN= 0007 


7 . 


* 


TOREAT 






YOBO. RLO. MFU. GN 
initial = J or H; or GN= 0007 


8 


* 


LOPEZ 






YOBO, RLO. MFU, GN 
initial = J or H; or GN= 0007 
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3.13.18. The HSE will not retrieve database records that have already been 

retrieved with another key. 

3. 13. 19. The HSE will retrieve the database record ID; the Dual-SN Formats; 
keys, their segment position and their related DI_VALs; Record Gender; and 
TAQ tags. 

3.13.20. Low Frequency Retrieval 

' 3.13,21. The HSE will retrieve records from the database that contain one or both 
of the query DLKEYs in any SN position in the database record. 

3.13.21.1. The HSE will retrieve records that contain the DLKEYs withm a 
specified YOB Range for a Refusal Code Level. 

3.13.21.2. The HSE will access the RLYOB Data Store to determine the 
Refusal Code Level and associated Year-of-Binh Range that will apply. 

3.13.21.3. The HSE will retrieve all records from the database with 

• both DLKEYs (or one DI_KEY and one SN_INIT Key) in either 
position and RLYOB restriction; 

• one of the DI_KEYs alone and RLYOB restriction; and 

• one DI_KEY in either position, if the Year-of-Birth Range is Y0B2 
and the Refusal Code Level is RLl (i.e.. 00 or Type 1 Serious). 

3.13.22. The HSE will retrieve the database record; the record ID; the Dual-SN 
Formats; keys, their segment position and their related DI_VALs; Record 
Gender; and TAQ tags. 

3.13.23. The HSE will not retrieve a record that has already been retrieved using 
other access methods (i.e., Mixed Frequency SN or HF names). 

3.13.24. All records retrieved from the database will be sent to the Hispanic Filter 
and Sorter. 

HISPANIC FILTER AND SORTER MODULE.DECOMPOSITION * 

14.1. Identification 
This module is known as the Hispanic Filter and Sorter (HFS). 



.3.14. 
3. 
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3.14.2. Type 

3.14.2.1. The HFS is a module that accepts database records retrieved by the 
HSE. 

3.14.2.2. The HFS compares each database record to the query record to 
determine if it qualifies for return to the user. 

3.14.2.3. The HFS is constituted of two subordinate functions: the Hispanic 
Filter and the Hispanic Sorter. 

3.14.2.4. The HFS must follow the Hispanic Search Engine (HSE). 
J. 14.3. Purpose 

3.14.3.1. The set of database records that the HSE will retrieve will be a set of 
records deliniited by quite narrow retrieval criteria. The database 
records will have a digraph value associated with most SN segments and 
with many GN segments. However, the relative value of the database 
records to the query record will not be clear. The HFS will, therefore, 
evaluate each of the records retrieved for its proximity to a query record, 
will retain those that pass a pre-established threshold and will sort the 
resultant candidate list. 

3.14.3.2. The filtering process will take into account a number of factors that 
play a role in determining the relative value of Hispanic names. 

3.14.3.3. The filtering process will take into account factors that aid in the 
determination of the relative value of a Hispanic records. 

3.14.4. Function 

The HFS will first compare and qualify the query name and database -record 
name to determine a surname value (SN_VAL), will then evaluate and qualify 
the query name and database record to determine a given name value 
(GN_VAL) and will generate a composite score for the database records that 
qualified on the basis of name evaluation by factoring in values for Date-of- 
Birth. Refusal Level and Country of Birth. 

The first comparison will be to identify an exact record match. Air other 
comparison will be between the Dual-SN Format of the query and database 
record (for records with more than two surnames). 
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3.14.4.L Filter Function of tlie HFS 



3.14.4.2. General 

3.14.4.3. The Hispanic Filter and Sorter (HFS) will accept the candidate 
database records retrieved by the HSE. 

3.14.4.4. The HFS will first determine if the query record and database record 
match exactly. 

3. 14.4.4. 1. The HFS will compare the base format of the query and 
database record; i.e., no derived format. 

3.14.4.4.2. The name (both SN and GN), Date-of-Birth and Country- 

of-Birth must match exactly. 

3.14.4.4.3. If the query and database records match exactly, the HFS 
will tag the record as an exact match and send the record 
directly to the Sorter Function of the HFS. 

3. 14.4.5. The HFS will calculate name scores for each candidate database 
record as it compares to the query record. 

3.14.4.5.1. The HFS will use the derived formats as the basis of record 

comparison. 

3.14.4.5.2. A score for the SN, the SN^VAL, will be calculated. 

3.14.4.5.3. A score for the GN. the GN_VAL. will be calculated. 

3. 14.4.5.3. 1 . The HFS will adjust the digraph value retrieved 
with the database record by multiplying that value by 
factors assigned to several parameters. 

3.14.4.5.3.2. Factors (sec Section 4,13) that contribute to the 
determination and evaluation of the name score 
(SN_VAL and GN.VAL) include 

• SNTHR 

• GNTHR 

• ASVAL 

• AGVAL 

• OPSVAL 

• OPGVAL 

• INITSN 

• INITGN 

• TAQASN 

• TAQAGN 

• TAQXSN 

• TAQXGN 

• RGNDR 
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3.14.4.5.3.3. To be included in the final candidate list, the score 
of the SN and the score of the GN must each pass pre- 
determined SN and GN threshold levels (SNTHR and 
GNTHR). 

3.14.4.6. Surname Evaluation 

3.14.4.7. A candidate record must pass a SN evaluation before it will be 
submitted to a GN evaluation. 

3. 14.4.8. No record with Last Name Unknown (LNU) will be handled by 

HNA-E. 

3. 14.4.9. The SN evaluation will be performed on the dervied formats 
(including the Dual-SN Formats) associated with the query and database 
records. 

3.14.4.10. High Frequency SN Keys (HFSN^KEYs or HFSN^VAR Keys) 

3.14.4.10.1. The HFS will compare the keys of the query and database 
and assign the DI_VAL retrieved with the database record to 
the SN Comparands with matching keys. 

3.14.4.10.1.1. Only one assignment of DI_VAL can be made for 
a match. 

3.14.4.10.1.2. If the query is GARCIA GOMEZ and the database 
record is GARCIA GARCIA, the HFS will assign the 
DI_VAL to one GARCIA match only. 

3.14.4.10.2. If the SN Keys do not match, the HFS will perform a 
digraph match of the segments with no assigned value 
(LOPEZ and GOMEZ in Figure 29) and will assign the 
digraph score to the DLVAL. 



Figure 29: Example: Database Records with HFSN KEYs to be Evaluated by HFS 





SN#1 


HFSN_KEY 


DLVAL 


SN#2 


HFSN.KEY 


DLVAL 


QUERY 


GARCIA 


0001 




GOMEZ 


0010 


















DATABASE RECORDS 


GARCIA 


0001 


1.00 


BOMEZ 


■ 0010 


0.67 




BARCIA 


0001 


0.71 


GAMEZ 


0010 


0.67 




LOPEZ 


0004 


0.17 


GARCIA 


0001 


1.00 



3. 14.4. 11. Low Frequency SN Keys (DI.KEYs) 

3. 14.4. 11.1. The HFS will assign the DI_VAL associated with the 
DI_KEY to matching database and query DI_KEYs. 

3. 14.4. 11.1.1. Only one assignment of DI_VAL can be made for 
a match. 
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3.14.4.1 1.1.2. If the query is THORET FLORENZAN and the 
database record is THORET THORET. the HFS will 
assign the DLVAL to one THORET match only. 

3.14.4.12. If the SN Keys do not match, the HFS will perform a digraph 

match of the segments with no assigned value (LOPEZ and GARCIA in 
Figure 30) and will assign the digraph score to the DI_VAL. 

3. 14.4. 1 2. 1 . If there is more than one pair that does not have an 
assigned digraph value, the HFS will perform a digraph 
evaluation for each of the pairs. (See Section 3. 14.4. 16 for 
details of the digraph assignment.) 

3.14.4.12.2. Each value will be submitted to parameter evaluation. 

3.14.4.12.3. After all parameters have been applied, the HFS will 
choose the highest score for each pair. (See Section 3. 14:4. 1 7) 

Figure 30: Example: Database Records with LF SN (Mixed or all LF Keys) to be 





SN#1 


HFSN.KEY/ 
DI-KEY 


DLVAL 


SN#2 


HFSN_KEY/ 
DI.KEY 


DLVAL 


QUERY 


GARCIA 


0001 




THORET 


000652 


















DATABASE RECORDS 


GARCIA 


0001 


1.00 


THORET 


. 000652 


1.00 




THORET 


000652 


1.00 


BARCIA 


0001 


0.71 




LOPEZ 


0004 


0.00 


THORET 


000652 


1.00 



3.14.4.13. The HFS will adjust the DI_VAL of each segment according to 
parameter values in the Hispanic Parameter Data Store (see Section 4.13 
for details). 

3.14.4.13.1. The HFS will determine if the appropriate parameter 
conditions obtain. 

3.14.4.13.2. If the appropriate conditions are present, the DLVAL will 
be multiplied by the value assigned to the parameter and the 
DLVAL will be lowered. 

3.14.4.13.3. Parameter Conditions 

3.14.4.13.4. INITSN: Initial 

3.14.4.13.4.1. Definition 1: The SN segment is a single 
character in both comparands and the character matches 
exactly. 

3.14.4.13.4.2. Action: The HFS will make no change. 

3.14.4.13.4.3. Definition 2: A SN segment is a single character 
and its SN_INIT Key matches the SN^INIT Key of the 
other comparand. 
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3.14.4.13.4.4. Action: Assign the INITSN value to the 
comparison value (i.e., do not calculate the DI_VAL). 
The initial may be subjected to any following actions 
(e.g., out-of-place segment). 

3.14.4.13.4.5. Definition 3: A SN segment is a single character 
and the SN JNIT Keys of the comparands do not match. 

3.14.4.13.4.6. Action: Assign the INITNM value to the 
comparison value (i.e., do not calculate the DI_VAL). 
The initial may be subjected to any following actions 
(e.g., out-of-place segment). 

3.14.4.13.5. OPSVAL: Out-of-Place Surname Value 

3.14.4.13.5.1, Definition: A SN segment that is not in the same 
relative position in the SN string in both the database 
and query records. 

3.14.4.13.5.2. Action: Multiply the DI.VAL by the OPSVAL. 

3.14.4.13.6. ASVAL: Anchor Surname Value 

3. 14.4. 13.6. 1 . Definition: For database records that contain two 
SN segments, the database SN segments are in the 
correct position relative to the query SN segments. 

3.14.4.13.6.2. Action: Multiply the DI.VAL of the second 
(rightmost) segment by the ASVAL. 



Figure 31: Example 1: SN Parameter Evaluation: OPSN Applie s 





GARCIA 


GOMEZ 


BOMEZ 




0.67 * 0.65 = 0.44 


GARCU 


1.00*0.65 = 0.65 • 





Figure 32: Example 2: SN Parameter Evaluation: OPSN Applie s 





GARCIA 


GOMEZ 


GAMEZ 




0.67*0.65 = 0.44 



Figure 33: Example 3: SN Parameter Evaluation: ASVAL Appl ies 





GARCIA 


GOMEZ 


GARZA 


0.62 




GOMEZ 




1.00*0.65 = 0.65 



3.14.4.13.7. TAQ Filter 

3. 14:4. 13.8. All TAQ tags (ID.NO. disposition. TAQ_TYPE and 

associated SN stem) will be retrieved with the database record. 
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3.14.4.13.9. The HFS will evaluate any TAQs associated with the SN 
segments being evaluated, except Stranded Prefixes (see 
Section 3.5.4.2.7.3). 

3.14.4.13.9.1. A Stranded Prefix will not play a role in the 
record comparison, 

3.14.4.13.10. Single TAQs 

3.14.4.13.11. Missing TAQs 

3.14.4.13.12. TAQASN: Absent TAQ Value 

3.14.4.13.12.1. Definition 1: One of the two comparands 
(query/database SN segment) has a TAQ tag, the other 
does not. 

3.14.4.13.12.2. Definition 2: Both comparands (query /database 
SN segments) have a single TAQ tag, one is a TAQ 
DELETE, the other a TAQ DISREGARD. 

3. 14.4. 1 3. 1 2.3. Action: Multiply the DI.VAL by the TAQASN 
value. 



Figure 34: Example: TAQ DISREGARD (DE) and No TAQ 





DE VARGAS 


VARGAS 


1.00 * 0.90 = 0.90 



Figure 35: Example: TAQ DISREGARD (DE) and TAQ DELETE (DR) 





DE VARGAS 


DR VARGAS 


1.00 * 0.90 = 0.90 



3.14.4.13.13. TAQ DELETE 

3.14.4.13.14. TAQXSN: Deleted TAQ Value 

3.14.4.13.14.1. Definition: Both SN comparands have a single 
TAQ DELETE tag. 

3.14.4.13.14.2. Action: 

3.14.4.13.14.3. If the TAQ DELETE tags refer to the same TAQ 
segment, the DI_VAL will be unchanged. 

3. 14.4. 1 3. 14.4. If the TAQ DELETE tags refer to different TAQ 
DELETE segments, multiply the DI.VAL by the 
TAQXSN value. 



Figure 36: Example: Same TAQ DELETE (DR) 





DR VARGAS 


DR VARGAS 


1,00 
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Figure 37: Example: Different TAQ DEL ETES (DR and SR) 





SR VARGAS 


DR VARGAS 


1.00*0.850 = 0.85 



3.14.4.13.15. TAQ DISREGARD 

3.14.4.13.15.1. Definition: The HPS will access the TAQ Filter 
Data Store (TF) to process records that both contain SN 
TAQ segments that have been tagged as DISREGARD. 

3.14.4.13.15.2. Action 1: The HFS will assign TAQDIS#1 to 
the TAQ DISREGARD segment for the database SN 
segment and TAQDIS#2 to the TAQ DISREGARD 

segment for the query SN segment. 

3.14.4.13.15.3. Action 2: If the two TAQ DISREGARD 
segments match, the DI_VAL will remain unchanged. 

3.14.4.13.15.4. Action 3: If the two TAQ DISREGARD 
segments do not match, the HFS will identify the 
TF.VALUE for the pair in the TF. 

3.14.4.13.15.4.1. The HFS will multiply the DLVAL by 
the TF_VALUE for the pair. 

Figure 38: Example: Different TAQ DISR EGARDS (DE and LA) 





DE PENA 


LA PENA 


1.00*0.75 = 0.75 



3.14,4.13.16. Multipart TAQs 

3.14.4.13.16.1. Definition: If at least one SN comparand has 

multipart TAQ tags (they may be all DISREGARD, all 
DELETE, or mixed DISREGARD/DELETE), the HFS 
will perform the following analyses. 

3.14.4.13.16.2. Action: If all TAQs match, HFS will make no 
change in the DLVAL. 

3.14.4.13.16.3. TAQ DELETES 

3.14.4.13.16.3.1. Definition: All DELETE tags 

3.14.4.13.16.3.2. Action 1: If any DELETE TAQ 
matches, the HFS applies no change. 

3.14.4.13.16.3.3. Action 2: If no DELETE TAQs 
match, multiply the DI.VAL by the TAQXSN 
Value. 

Figure 39: Example: Multiple TAQ DELETES \Vith Some Match 
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REV DR VARGAS 


REV VARGAS 


1. 00 


Figure 40: Example: Multiple TAQ DELETES with No Match 




GENERAL DR VARGAS 


REV SR VARGAS 


1.00*0.85 = 0.85 



3.14.4.13.16.4. TAQ DISREGARDS 

3.14.4.13.16.4.1. Definition: All DISREGARD tags 

3.14.4.13.16.4.2. Action 1: If any TAQ DISREGARD 
segment matches, the HPS will make no change 
in the DI.VAL. 

3.14.4.13.16.4.3. Action 2: If no TAQ DISREGARD 
segments match, the HPS will identify the 
highest match value from the TF (TF.VALUE) 
and multiply that by the DLVAL. 

Figure 4 1 : Example: Multiple TAQ DISREGARDS with Matching TAQ Segment (DE 

LAS/DE LOS) 





DE LAS LUNAS 


DE LOS LUNAS 


1.00 



Figure 42: Example: Multiple TAQ DISREGARDS with No Matching TAQ Segment 
(DE SANTAA.A) 





DE SANTA MARU 


LA MARU 


1.00*0.75=0.75 



3.14.4.13.16.5. TAQ DISREGARD and DELETES 

3.14.4.13.16.5.1. Definition: Mixed 
DISREGARD/DELETE tags 

3.14.4.13.16.5.2. Action 1: If DISREGARD segments 
are present in both comparands and there is any 
match among the DISREGARD segments, the 
HFS will make no change in the DLVAL. 

3.14.4.13.16.5.3. Action 2: If DISREGARD segments 
are present in both comparands and there is no 
match among the DISREGARD segments, the 
HFS will determine the highest match value from 
the TF for any DISREGARD tags and multiply 
the DLVAL by that value. (That is, ignore any 
DELETE tags.) 
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3.14.4.13.16.5.4. Action 3: If a DISREGARD segment 
is in one comparand and not the other and the 
two comparands have at least one DELETE lag 
that matches, the HPS will make no change in the 
DI.VAL. 

3.14.4.13.16.5.5. Action 4: If a DISREGARD segment 
is in one comparand and not the other and the 
two comparands have DELETE tags thai do not 
match, multiply the DI.VAL by the TAQXSN. 



Figure 43: Example: Multiple TAQs, DIS REGARDS (DE/LOS) 





SR DE VARGAS 


DRLOS VARGAS 


1.00* 0.75 = 0.75 



Figure 44: Example: Multiple TAQs, DE LETES (DRA/DR) 





DRA DE VARGAS 


DR VARGAS 


1.00* 0.85 = 0.85 



3.14.4.14. After all parameters have been applied, the HFS will calculate the 
SN_VAL. 

3.14.4.14.1. The HFS will choose the highest value for the row and 
column for any SN segments that have more than one digraph 
value assigned to them. 

3.14.4.14.2. The HFS will sum the DLVALs of all SN segments and 
will divide by the number of DLVALs. 



Figure 45: Example 1: Filter Evaluation 





GARCIA 


GOMEZ 


BOMEZ 




0.67 • 0.65 = 0.44 


GARCIA 


1.00*0.65 = 0.65 




Figure 46: Example 2: Filter Evaluation 




GARCIA 


GOMEZ 


GAMEZ 




0.67*0.65 = 0,44 


Figure 47: Example 3: Filter Evaluation 




GARCIA 


GOMEZ 


GARZA 


0.62 




GOMEZ 




1.00 * 0.65 = 0.65 



3. 14.4.14.3. In Figure 45. 0.44 + 0.65 / 2 = 0.55 

3.14.4.14.4. In Figure 46. 0.44/ 1 =0.44 

3. 14.4. 14.5. In Figure 47. 0.62 + 0.65 / 2 = 0.64 
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3.14.4.15. The HFS will compare the SN_VAL to the SNTHR. 

3. 14.4. 15.1. The SN_VAL must be equal to or greater than the 
SNTHR. 

3.14.4.15.2. If the SNTHR were 0.60. only Example 3 above would 
pass. 

3.14.4. 15.3. The record must pass the SNTHR to qualify for Given ' 
Name Evaluation. 

3.14.4.16. Given Name Evaluation 

3.14.4.16.1. The HFS will evaluate the GN in a similar way to the SN 
evaluation. 

3.14.4.16.2. The HFS will assign a DLVAL of 1.00 to any match with 
FNU. 

3.14.4.16.3. The GN format will permit more than two GN segments 
to be evaluated. 

3.14.4.16.3.1. If the segment pair has a matching HFGN^KEY, 
the digraph value (DI_VAL) retrieved with that key will 
be assigned to the pair being evaluated. 

3.14.4.16.3.2. For any segment pair that does not have a 
HFGN^KEY and associated DI_VAL. the DI.VAL will 
be calculated. (See Section 3. 1 1 .4. 19.4 for digraph 
evaluation.) 

3.14.4.16.3.3. The HFS will not calculate a digraph value for a 
GN_INIT Key value or GN initial. 

3.14.4.16.3.3.1. The HFS will calculate the digraph 
relationship for all segments that have not been 

assigned a DI_VAL. 

3.14.4.16.3.3.2. The HFS will not compare names that 
have a DLVAL assigned. 



Figure 48: Example: GN Digraph Evaluation 





MARIA 


LORNA 


SILVIA 


CATERINA 


CATHERINA 








0.74 
(HFGN^FCEY) 


MARIA 


1.00 
(HFGN^KEY) 








LARA 




0.36 


. 0.08 
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I MILDRED I I 0.00 | 0.07 | | 

3.14.4.16.4. The DLVAL of each GN segment will be adjusted by 
several GN parameters. 

3.14.4.16.5. INITGN: Given Name Initial 

3.14.4.16.5.1. Definition 1: The GN segment is a single 
character in both comparands and the character matches 
exactly. 

3.14.4.16.5.2. Action: The HFS will make no change. 

3.14.4.16.5.3. Definition 2: A GN segment is a single character 
and its GN_INIT Key matches the GN_INIT Key of the 
other comparand. 

3.14.4.16.5.4. Action: Assign the INITGN value to the 
comparison value (i.e., do not calculate the DI_VAL). 
The initial may be subjected to any following actions 
(e.g., oul-of-place segment). 

3.14.4.16.5.5. Definition 3: A GN segment is a single character 
and the GN_INIT Keys of the comparands do not match. 

3.14.4.16.5.6. Action: Assign the INITNM value to the 
comparison value (i.e., do not calculate the DI_VAL). 
The initial may be subjected to any following actions 
(e.g., out-of-place segment). 

3.14.4.16.6. OPGVAL: Out-of-Pl ace Given Name Value 

3.14.4.16.6.1. Definition: A GN segment that is not in the same 
relative position in the GN string in both the database 
and query records. 

3.14.4.16.6.2. Action: Multiply the DLVAL by the OPGVAL. 

3.14.4.16.7. AGVAL: Anchor Given Name Value 

3.14.4.16.7.1. Definition: For database records that contain two 

or more GN segments, the database SN segments are in 
the correct position relative to the query SN segments. 

3.14.4.16.7.2. Action: Multiply the DLVAL of the GN 
segments to the right of the first (leftmost segment) by 
the AGVAL. 



Figure 49: Example 1: GN Parameter Evaluation: OPGN Applie s 





MARIA 


CATHERINA 


KATHERINA 




0.90^* 0.65 = 0.59 


MARU 


1.00*0.65=0.65 





HNA-E 

Language Analysis Systems. Inc. 



62 



03/19/98 



Figure 50: Example 2: GN Parameter Evaluation: OPGN Applie s 





JOSE 


BARTOLOMEO 


BARTO 




0.71 * 0.65 = 0.46 



Figure 51: Example 3: GN Parameter Evaluation: AGVAL App lies 





JUAN 


MARIO 


JUANA 


0.73 




MARIA 




0.83 ♦ 0.65 = 0.54 



Figure 52: Example 4: Given Name Parameter Evaluation 





MARIA 


LARA 


MILDRED 


CATERINA 


CATHERINA 








0.74*0.65 = 0.48 
(OPGVAL) 


MARIA'^^- 


1.00*0.65 = 0.65 
(OPGVAL) 








LORNA 




0.36*0.65 = 0.23 
(OPGVAL) 


0.08*0.65 = 0.05 
(OPGVAL) 




SILVIA 




0.00*0.65 = 0.00 
(OPGVAL) 


0.07*0.65 = 0.05 
(OPGVAL) 





3.14,4.16.8. TAQ Evaluation will proceed as with the SN, mutatis 
mutandi (See Section 3.14.4.13.7). 

3.14.4.17. After all GN evaluations have been performed, the HFS will 

choose the highest score for each GN segment that has multiple 
DI_VALs (i.e.. those for which no DLVAL was retrieved with the key). 

3.14.4.17.1. The highest score for both the row and column must be 
chosen. 

3.14.4.17.2. Only one score per row and column is permitted. 

3.14.4.17.3. If two scores are equal, only one is chosen. 

3. 14.4. 1 7.4. In the example above, the higher score for LORNA is on 
the match with LARA (0.23); for SILVIA, MILDRED (0.05). 

3.14.4.17.4. 1 . Note that MILDRED scores are equal, but the row 
for LORNA has already been chosen. 

3.14.4.17.4.2. Only one value can be chosen for each'row 'and 
column. 

3.14.4. 18. The HFS will sum all DI_VALs from the comparison matrix and 
will divide by the number of DLVALs to produce the GN score. 

3.14.4.18.1. In Example 1.0.59 + 0.65/2 = 0.62 

3. 14.4. 1 8.2. In Example 2, 0.46/1 = 0.46 

3.14.4.18.3. In Example 3. 0.73 + 0.54/2 = 0.64 

3.14.4.18.4. In Example 4. 0.48 + 0.65 + 0.23 + 0.05/4 = 0.33 
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3.14.4.19. The HFS will further evaluate the Given Name by comparing the 
record gender of the two comparands. 

3.14.4.19.1. If the record genders match, no action will take place. " 

3.14.4.19.2. If the record genders do not match, the HFS will apply the 
RGNDR value to the GN score. 

3.14.4.19.2.1. The HFS will access the TF to determine the " 
RGNDR Value. 

3.14.4. 19.2.2. The HFS will multiply the GN score by the 
RGNDR value. 

3.14.4.20. The value resulting from the full GN evaluation will be the 
GN.VAL. 

3. 14.4.2 1 . The HFS will compare the GN_VAL to the GNTHR. 

3. 1 4.4.2 1.1. The GN_V AL must be equal to or greater than the 
GNTHR. 

3. 14.4.2 1 .2. The GN.VAL must pass the GNTHR for the record to 
qualify for calculation of the Composite Score. 

3.14.4.22. Composite Score 

3.14.4.23. The HFS will develop a composite score for two comparands that 
will reflect the proximity of the query and database records. 

3.14.4.24. The Composite Score will be used to rank order the records being 
evaluated. * 

3. 14.4.25. The Name is one component of the Composite Score; others are 
the Refusal Level, Date-of-Birth and Country of Binh. 

3.14.4.25.1. The HFS will adjust the GN_VAL and the SN^VAL by 
factors that reflect the proximity of the Refusal Level. Date-of- 
Birth and Country of Birth. 

3. 14.4.25.2. The GN.VAL and SN_VAL will be multipli^^d by RL 
YOB and COB factors. ' 
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3.14.4.26. Refusal Level Factor 

3. 14.4.27. The HFS will access the Refusal Code Level Data Store to 
determine the Refusal Level Category of the Refusal Code. 

3.14.4.28. The HFS will access the Hispanic Parameter Data Store to find the 
PARM.VAL associated with the Refusal Level (RL#). 

3.14.4.29. Date-of-Birth Factor 

3.14.4.30. The HFS will access the Year-of-Birth Range Data Store to 
determine the YOB Category, YOB#, of the Dates-of-Birth of the 
comparands. The highest value is applied to the relationship. 

3.14.4.31. The HFS will access the Hispanic Parameter Data Store to find the 
PARM.VAL associated with the YOB Category (YOB#). 

3.14.4.32. Country-of-Birth Factor 

3.14.4.33. The HFS will access the Hispanic Country-of-Birth Category Data 
Store (HCOB) to determine the COB Category, COB#. 

3.14.4.33.1. The HFS will identify the COB#, 

3. 14.4.33.2. The HFS will access the Hispanic Parameter Data Store 
to find the PARM_VAL associated with the Country-of-Birth 
Category (COB#). 

3. 14.4.34. Calculating the Composite Score 

3.14.4.35. The HFS will calculate a Composite Score by multiplying the • 
SN_VAL by the GN_VAL by the RL# PARM.VAL by the YOB# 
PARM.VAL by the COB# PARM.YAL. 

3. 14.4.36. Final Sort Function of the HFS 

3. 14.4.37. The HFS will order the final candidate list. 

3.14.4.38. The HFS will place at the top of the candidate list all records that 
have been tagged as exact matches. 

3.14.4.39. The HFS will then rank order in descending order of Composite 
Score all records for which a Composite Score has been calculated. 

3.14.4.39.1. The goal of the final sort is to place exact record matches 
on the top and to rank order the remaining records by the 
degree of contribution that each data element (SN, GN, DOB, 
COB, Refusal Code Level (RL)) makes to the overall record 
value. 

3.14.4.39.2. Further details of the sort will be derived from extensive 
discussion about the business requirements. 

3.14.4.39.3. Because the scores from the various pipes may not have 
been calculated in th^ same way, a method for evaluating the 
relative value of candidate records will have to be devised. 
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3.14.4.40. Internal Sort Order 

3.14.4.40.1. There may be cases in which the sorting criteria are met 
equally by more than 1 record. 

3. 14.4.40.2. Where multiple records qualify equally, there will be an 
internal sort order. 

3.14.4.40.2.1. SN Score 

3.14.4.40.2.2. GN Score 

3.14.4.40.2.3. DOB Level 

3.14.4.40.2.4. Refusal Code Level 

3.14.4.40.2.5. COB Relationship 

3.14.4.41. The HFS will return the top n records to the central CLASS-E 
sorter. 

3. 14.4.4 1.1. The number of records to be returned will be a system 
setting. 

3.15. LINGUISTIC TRACE FACILITY MODULE DECOMPOSITION 

3.15.1. Identification 

This module is known as the Linguistic Trace Facility (LTF). 

3.15.2. Type 

The LTF is a program that will interact with any or all modules and functions 
within those modules. 

3.15.3. Purpose 

The LTF will allow system evaluators to access information about the system 
functions so that the quality of the content can be ensured. To diagnose and 
remedy problems associated with questionable system results, evaluators 
must have access to the results of system functionality at various points 
during the processing cycle. 

3.15.4. Function 

3.15.4.1. The LTF will be a mechanism that will copy and divert statistics, 
information, processing results to a file outside the main processing 
module. 

3. 15.4.2. The file will be readily accessible for on-line examination by system 

evaluators. 

3.15.4.3. Multiple trace points will be identified when the system is built. 

3.15.4.4. Examples of trace points: 

• Derived record formats 

• All keys generated for a que/y and for an add 
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• Records qualifying with the LFDLKEY 

• SN and GN DLVAL 

• SN_VAL and GN^VAL 

• Record Gender 

• RL#, YOB#. COB# Values 

• Sort considerations. 

4. DATA DECOMPOSITION 
4.1. DATA 

4. 1. 1 . The input data for an HNA-E query will contain all information that is 
currently required by CLASS. 

4. 1 .2. The input data for an HNA-E query will be in the standard fonnat 
currently required by CLASS. 

• NAME (Surname, Given Name); 

• The SN is a required name field and therefore must be filled. 

• Last Name Unknown (LNU) is not a permitted string in HNA-E. 

• The SN may be represented by a single character, which will be 
interpreted as an initial. 

• DOB (Date of Birth; Day Month Year); and 

• COB (Country of Birth; FIPS codes). 

4. 1 .3. In addition, the following will be specified: 

• Applicant Gender (AG): Male (M), Female (F), Unknown/Ambiguous 
(U). 

• A unique identifier (UID) (as defined in CLASS-E). 

4.1.4. For record adds, additional record information will be entered, as required 
by CLASS and CLASS-E: e.g., refusal code, province of birth. 
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4.2. DATA COLLECTION 



4.2. 1 . Two alternative approaches to tagging the name data are available: the 
nanne as an object and the name as a data element. 

4.2.2. The system could define the name as an object that knows something 
about itself and collects information as it passes through the various 
processing modules. 

4.2.2. L A name object would make the relevant information available to the 
various processing modules, as needed, from one consistent, predefined 
object. 

4.2.2.2. A name object may also permit the same.name to be handled in the 
same way on another occasion. Reuse of information would be 
especially valuable for HF names. 

• 4.2.3. The second, alternative approach is to tag the specific items as they 

undergo processing or change, to access information in data stores as it is 
needed, and to tag the name or name segment for the relevant processes it 
undergoes. 

4.3. DATA STORES 

4.3. 1 . Several of the Data Stores proposed could be collapsed into one data store 
(e.g., the HF SN Data Stores: HFST and HFSV); for ease and clarity of 
exposition and reference, the data stores have been maintained separately. 

4.3.2. HNA-E will access X Data Stores: 

• Hispanic TAQ Data Store (HTD) 

• High Frequency Surname Type Data Store (HFST) 

• High Frequency Surname Variant Data Store (HFSV) 

• Low Frequency Surname Type Data Store (LFST) 

• Hispanic Given Name Type Data Store (HGT) 

• High Frequency Given Name Variant Data Store (HFGV) 

• Hispanic Character Data Store (HCD) 

• Hispanic Parameter Data Store (HPD) 

• High Frequency Decision Matrix (HDM) 

• Refusal Code Level Data Store (RCL 

• Year-of-Birth Range Data Store (YR) 

• Refusal Code LevelA'ear-of-Birth Range Data Store (RLYOB) 

• Country-of-Birth Proximity Data Store (COBPROX) 

• Hispanic Country-of-Birth Category Data Store (HCOB) 

4.4. HISPANIC TITLE/AFFIX/QUALIFIER DATA STORE DECOMPOSITION 
Because the HNA-E design is viewed as an independent sub-program of the 
CLASS-E system, the Hispanic Title/Affix/Qualifier Data Store is presented here as 
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a separate table. It is strongly suggested, however, that CLASS-E support one TAQ 
Data Store in which the cultural affinity of each TAQ segment is indicated. This 
will reduce table maintenance and will provide a global picture of the handling of 
TAQs. 

4.4.1. Identification 

This data store is known as the Hispanic Title/Affix/Qualifier (TAQ) Data 
Store (HTD). 

4.4.2. Type 

j^^.^ 4.4.2.1. The HTD is a data store that contains the Hispanic-specific Title. 

Affix and Qualifier segments with additional information about the 
disposition of the items. 

4.4.2.2. The HTD will be accessed by the Hispanic Name Preprocessor 
(HNP) and the Hispanic Filter and Sorter (HFS). 

4.4.2.3. Theformatof theHTD willbe 



Figure 53: Format: Hispanic TAQ Data Store (HTD) 



DATA FIELD 


DATA TYPE 


FIELD SIZE 


DATA RANGEA^ALUE 


ID^NO' 


integer 


4 


0...9999 


TAQ_FORM 


character 


15 


alphabetics 


TACLTYPE 


character 


1 


T. I. P. S 0 


DELETE 


integer 


1 


1. 0 (True, False) 


DISREGARD 


integer 


1 


1,0 (True, False) 


REMOVE 


integer 


1 


1.0 (True, False) 



4.4,2.4. Definitions 

4.4.2.4.1. ID_NO: a unique, arbitrary number that identifies the TAQ 
segment. 

4.4.2.4.2. TAQ_FORM: the string that represents the TAQ; the 
TAQ_FORM may be a multipart string (i.e., a string that 
includes internal white space). 

4.4.2.4.3. TAQ_TYPE: an indicator of the kind of TAQ segment 
present: a title (T), prefix (P), infix (I), suffix (S), or qualifier 
(Q). 

4.4.2.4.4. DELETE: 

4.4.2.4.4. 1 . The segment is to be removed from all further 
consideration in the name search process. 

4.4.2.4.4.2. The segment is referenced in the filtering 
process. 

4.4.2.4.4.3. The segment is not removed from the original 
record and is returned with the record to the user. 



HNA-E 

LjAguage Analysis Systems. Inc. 



69 



03/19/98 



4,4.2.4.4.4. True ( 1 ) or False (0) indicates whether or not this 
function is to apply to the segment(s) under 
consideration. 

4.4.2.4.5. DISREGARD: 

4,4.2.4,5.1. The segment is to be removed from further 
consideration in the name search process but will 
undergo special evaluation in the filtering process. It 
will be returned with the record to the user. 

^^.^ 4.4.2.4.5.2. True (1) or False (0) indicates whether or not this 

function is to apply to the segment(s) under 
consideration. 

4.4.2.4.6. REMOVE: 

4.4.2.4.6. 1 . The segment occurs attached to the name stem.. 

4.4.2.4.6.2. The conjoined TAQ will be separated from a base 
name segment. (See Section 3.5.4.4.3). 

4.4.2.4.6.3. True (1) or False (0) indicates whether or not this 
function is to apply to the segment(s) under 
consideration. 

4.4.2.4.6.4. The separated segment will also be marked for 
DELETE/DISREGARD treatment. 

4.4.3. Purpose 

Peripheral elements (Titles, Affixes, and Qualifiers) in names do not 
contribute as much to the name evaluation as does the name stem. Identifying 
and removing these elements in the name processing component is important. 
They do, however, contribute to the overall value of a name when determining 
the proximity of one name to another. They will therefore contribute some 
value to the filtering and sorting processes. 

4.4.4. Function 

The HTD serves as a repository for all TAQ values and for the treatment that 
each will be subjected to. 

4.5. HIGH FREQUENCY SURNAME TYPE DATA STORE DECOMPOSITION 
4.5.1. Identification 

4.5. 1 . 1 . This data store is known as the High Frequency Sumame Type Data 
Store (HFST). 

4.5.1.2. This data store could be merged with the High Frequency Surname 
Variant Data Store (HFSV). 

4.5.1.2.1. The ID_NO would be different in the HFSV and would 
serve as a unique identifier for each entry. 
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4.5. 1 .2.2. The set of HFSN^TYPEs. with no variants, would be 
derivable from the HFSN.VARs with a DLVAL equal to 
1.00. 

4.5.1.2.3. The SETJD of the HFST and HFSV would be the same. 
4.5.2. Type 

4.5.2. 1 . The HFST data store consists of the 500 most frequently occurring 
HF SN segment types (i.e., unique occurrence). 

4.5.2.2. The HFST will be accessed by the Hispanic Surname Segmenter, 
Hispanic Segment Positioner, and the Frequency Path Director modules. 

4.5.2.3. The format of the HFST will be 



DATA FIELD 


DATA TYPE 


FIELD SIZE 


VALUE RANGE 


ID NO 


inicger 


4 


1...9999 


HFSN_TYPE 


character 


24 


alphabetics 


SET.ID 


imeger 


4 


I... 9999 



ID_NO 


HFSN TYPE 


SET ID 


0001 


GARCIA 


0001 


0002 


RODRIGUEZ 


0002 


0003 


HERNANDEZ 


0003 


0004 


LOPEZ 


0004 


0005 


MARTINEZ 


0005 


0006 


GONZALEZ 


0006 


0007 


PEREZ 


0007 


0008 


SANCHEZ 


0008 


0009 


RAMIREZ 


0009 


0010 


GOMEZ 


0010 


0011 




0011 



4.5.2.4. Definitions 

4.5.2.4. 1 . ID_NO will be a unique numerical identifier for each of the 

HF SN segments, HFSN_TYPEs. 

4.5.2.4.2. HFSN_TYPE will contain a unique character string that 
represents one of the 500 most frequently occurring Hispanic 
surname stems. 

4.5.2.4.3. SET_ID will be the unique identifier for the set of variants 
of the HFSN.TYPE and will be used as the HFSN^KEY. 
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4.6, HIGH FREQUENCY SURNAME VARIANT DATA STORE 
DECOMPOSITION 

4.6. 1 . Identification 

4.6. 1.1. This data store is known as the High Frequency Surname Variant 
Data Store (HFSV). 

4.6. 1.2. This data store will be updated in real time as variants qualify for 
inclusion in the data store. (See Section 3.12.4.38) 

4.6. 1 .3. This data store could be merged with the HFST {See Section 4.5. 1 .2.) 

4.6.2. Type 

4.6.2. 1 . The HFSV is a data store that consists of a HFSN_TYPE segment 
with a variant of that segment and a value that represents the degree of 
digraph proximity of the HFSN.TYPE and its variant. 

4.6.2.2. The HFSV is a data store that will have between 75,000 and 100,000 
rows. 

4.6.2.3. A name segment may be the variant of more than one HFSN^TYPE. 

4.6.2.4. The HFSV will be accessed by the High Frequency Processor and 
Low Frequency Processor. 

4.6.2.5. Theformatof the HFSV will be 



Figure 56: For mat: High Frequency Surname Variant Data Store (HFSV) 



DATA FIELD 


DATA TYPE 


FIELD SIZE 


VALUE RANGE 


ID_NO 


integer 


6 


000000... 999999 


HFSN_VAR 


character 


24 


alphabetics 


SETJD 


integer 


4 


0000... 9999 


DLVAL 


decimal 


4 


0,00... 1.00 



Figure 57: Example: High Frequency Surname Variant Data Store 



ID_NO 



032711 



032712 



032713 



016976 



016977 



016978 



HFSN VAR 



PEREZ 



PERES 



PEREZA 



GOMEZ 



GOMES 



BOMEZ 



SET ID 



0007 



0007 



0007 



0010 



0010 



0010 



DI_VAL 



1. 00 



0.67 



0.77 



1.00 



0.67 



0,67 



4.6.2.6. Definitions / 

4.6.2.6.1. ID_NO will be a unique numerical identifier for each 
HFSN.VAR entry. 
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4 6 2 6 2 HFSN^V AR will contain a character string that has been 
detemiined to be a variant of the HFSN^TYPE with which it 
is associated. A HFSN.VAR may be a variant of one or more 
"'of the HFSN.TYPEs. 

4 6.2.6.2.1. A variant is defined as a name stem that shares a 

sufficient number of digraphs (strings of two characters) 

with the HFSN^TYPE to pass a pre-determined 

threshold. 

4 6 2 6 2 2 A HFSN TYPE can be obtained from the HFST or 
from the HFSV as a HFSN.VAR with a DI_VAL equal 

to 1.00. 

4 6.2.6.3. DLVAL is a two-place decimal value that represents the 
proximity of the HFSN^TYPE and the HFSN.VAR. 

4.6.2.6.3.1. The DLVAL is a calculation derived from the 
shared digraphs (strings of two characters) of the 
HFSN.TYPE and the HFSN_VAR associated with it. 

4.6.2.6.3.2. The calculation is determined in the following way: 
4.6,2.6.3.2.1 . The digraphs are identified for each name 

stem, the HFSN^T YPE (Comparand # 1 ) and the 
HFSN.VAR (Comparand #2). 

4.6.2.6.3.2.1.1. Each pair of alphabetic 
characters is identified: GOMEZ 

GO/OM/ME/EZ 

4.6.2.6.3.2.1.2. A digraph is also formed of 
the initial boundary (#) and the 
first alphabetic character: 
GOMEZ -> #G. 

4.6.2.6.3.2.1.3. A digraph is also formed of 
the final alphabetic character and 
the final boundary (#): GOMEZ 
^ Z#. 

4.6.2.6.3.2.2. The number of shared digraphs is 
calculated; a digraph may match one digraph 
only. 

4.6.2.6.3.2.3. The number of shared digraphs is 
multiplied by 2 and divided by the total number 
of digraphs in comparand #1 added to the total 
number of digraphs in comparand #2. 

4.6.2.'6.3.2.4. The formula is: 
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2 * d / a + b, where d = the total number 
of shared digraphs; where a = the total 
number of digraphs in Comparand #1 and 
where ib = the total number of digraphs in 
Comparand #2. 



COMPARANDS 



Figure 58: Example: Digraph Evaluation of Two Comparands 



COMPARAND #1: 
DOMINGUEZ 



DIGRAPHS 



#D DO OM MI IN NO GU UE EZ Z# 



COMPARAND #2:^^ <4P DO OM MIIN NQ QU UE ES S# 
DOMINOUES 



SHARED DIGRAPHS (d) 



#D DO OM Ml IN UE 



= 6 



DI_VAL 



2*d/a + b= 12/20 



.60 



4.6.3. Purpose 

4.6.3.1. For HNA-E to be an effective retrieval system, it must be able to • 
retrieve variants of query names. The impact on system performance can 
be dramatic, however, if traditional matching techniques are used to 
identify variant names. By assigning variants to the same set and 
recording their digraph value, querying a HF surname will result in the 
direct retrieval of variant records and their digraph values. 

4.6.3.2. The HFS V also serves as a resource for identifying which HF 
surnames are related to a LF surname. 

4.6.4, Function 

The HFSV Data Store will be dynamically updated. (See Section 3.12.4.38 for 
details.) 

4.7. HIGH FREQUENCY DECISION MATRDC DATA STORE 

4.7.1. Identification 

This data store is known as the High Frequency Decision Matrix (HDM). 

4.7.2. Type 

4.7.2. 1 . The HDM is a data store that will provide criteria for database' record 
retrieval for query records with HF name segments. * 

4.7.2.2. It will be accessed by the High Frequency Processor (HFP). 



DATA FIELD 


DATA TYPE 


FIELD SIZE 


VALUE RANGE 


QUERY SN FORMAT 


character 


2 


A.B 


DATABASE SN FORMAT 


character 


2 


A. B.C 


YOB RANGE (YOB#) 


integer 


I 


0...6 


REFUSAL CODE LEVEL 
(RL#) 


integer 


1 


0...4 


RECORD GENDER (RGNDR) 


character 


/ 3 


M. F.U 
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Figure 60: Example: Hispa 


nic Decision Matrix (PIDM) (Values for example only) 










Single-Segment SN 




Two-Segment SN 


QUERY SN FORMAT 


A 


A 


A 




AB 


AB 


AB 


AB 


AB 


AB 


AB 


AB 


DATABASE SN FORMATS 


A 


AB 


BA 




AB 


BA 


A 


B 


AC 


CA 


CB 


BC 


YOB# 


5 


5 


2 




5 


4 


4 


2 


2 


0 


0 


0 


RL# 


4 


4 


3 




4 


4 


4 


1 


1 


0 


0 


0 


RGNDR 


MFU 


MFU 


MFU 




MFU 


MFU 


MFU 


MFU 


FU 


MFU 


MFU 


MFU 



-->-'- 4.7,3. Definitions 

""'"^ 4.7,3. 1 . QUERY SN FORMAT: is a character string that is an abstract 

representation of the query SN. Each segment is represented by a single 
character, the leftmost A, the next B. The sequence also represents the 
position of the segment. 

4.7.3.2. DATABASE SN FORMAT: is a character string that is an abstract 
representation of the possible and acceptable variations in the query SN 
which are relevant to the QUERY SN FORMAT and which will be 
retrieved from the database, given the conditions stipulated in the YOB 
RANGE (YOB#), REFUSAL CODE LEVEL (RL#) and RECORD 
GENDER (RGNDR). Each segment is represented by a single character. 

4.7.3.2.1. If the character is the same as a character in the QUERY SN 
FORMAT, it represents the same SN Key. 

4.7.3.2.2. If the character is different from a character in the QUERY 
SN FORMAT, it represents a different SN Key. 

4.7.3.2.3. If the character is in the same relative position as that in the 
query SN, it represents the same position in the SN string. 

4.7.3.2.4. If the character in not in the same relative position as that in 
the query SN, it represents a different (out-of) position in the SN 
string. 
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4.7.3.3. YOB RANGE (YOB#): is an integer that represents a YOB range 
specified in the YOB RANGE (YR) Data Store, (N.B. In this scheme. 
YOB# integer does not represent the year range itself. It refers to a table ■ 
that specifies that Y0B2, for example, represents an exact year-of-birth 
and that Y0B3 represents a range of 1 year on either side of the query 
year (for a range total of 3 years).) 

4.7.3.4. REFUSAL CODE LEVEL (RL#): is an integer that represents a 
Refusal Code Uvel specified in the REFUSAL CODE LEVEL Data 
Store. (N.B, In this scheme, this number represents a set of Refusal 
Codes that has a pre-determined degree of seriousness. The number 
given here does not signal the Refusal Code itself. The number is 
expanded in the Refusal Code Level Data Store, where 0, for example, 
might represent a 00 Refusal Code.) 

4.7.3.5. RECORD GENDER (RGNDR): is a set of up to three characters 
that represent the required Record Gender of the database record. 

4.7.4. Purpose 

Many Hispanic surnames occur with very high frequency; they also generally 
have at least two segments. Any retrieval system that captures only one of 
these names will have an inordinately high recall. Many of these records will 
not be at all relevant to the query record. Special treatment of high frequency 
names must entail some method of reducing the number of irrelevant records 
retrieved from the database. The HDM provides the information about how to 
delimit the records that will be retrieved from the database. A reduction in the 
recall will reduce post-processing time. 

4.7.5. Function . 

The HDM is a data store that consists of qualifying and delimiting criteria. 

4.7.5. 1 . Qualifying criteria will be the number of SN segments, SN content, 
and SN segment positions. 

4.7.5.2. Delimiting criteria will be Year-of-Birth (YOB) Range (YR). Refusal 
Code (RC) Level (RL) and Record Gender (RGNDR). 

4.7.5.2.1. The qualifying criteria will produce a set of SN formats to 
retrieve from the database. 

4.7.5.2.2. The delimiting criteria will specify the YOB range. . 
maximum RC Level for each of the SN formats and Record 
Gender limitations, if any. 
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4.8. HISPANIC GIVEN NAME TYPE DATA STORE DECOMPOSITION 
•4.8.1. Identification 

4.8.1.1. This data store is known as the Hispanic Given Name Type Data 

Store (HGT). 

4.8.1.2. This data store could be merged with the High Frequency Given Name 
Variant Data Store (HFGV). 

4.8.1.2.1. The ID^NO would be different in the HFGV and would 
serve as a unique identifier for each entry. 

4.8. 1 .2.2. The set of HFGN.TYPEs, with no variants, would be 
derivable from the HFGN.VARs with a DLVAL equal to 

LOO. 

- 4.8.1 .2.3. The SET_ID of the HFGT and HFGV would be the same. 
4.8.2. Type 

4.8.2.1. The HGT data store will consist of up to ten thousand entries. 

4.8.2.2. The HGT will be accessed by the Hispanic Gender Identifier. 
Frequency Path Director (FPD), the High Frequency Processor. 

4.8.2.3. The HGT will have the following format: 



Figure 61: Format: Hispanic Given Name Type Data Store (HGT) 



DATA FIELD 



ID^NO 



GN^TYPE 



SETJD 



DATA TYPE 



integer 



characier 



integer 



FIELD SIZE 



24 



VALUE RANGE 
0000... 9999 



alphabetics 



001...999 



Hl.FREQ 



integer 



1.0 (True. False) 



GNDR 



character 



M. F. U 



ID NO 


GN_TYPE 


SET_ID 


HLFREO 


GNDR 


0001 


JOSE 


0001 




M 


0002 


MARIA 


0002 




F 


0003 


JUAN 


0003 




M 


0004 


LUIS 


0004 




M 


OOOS 


ANTONIO 


0005 




M 


0006 


CARLOS 


0006 




M 


0007 


JESUS 


0007 




M 


0008 


MANUEL 


0008 




M 


0009 


FRANCISCO 


0009 




M 


0010 


JORGE 


0010 




M 


0011 




0011 






2367 


DAGOBERTO 


0000 


0 


M 
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4.8.2.4. Defmiiions 

4.8.2.4.1. ID_NO: is an integer that is a unique numerical identifier 
for each of the GN^TYPEs. 

4.8.2.4.2. GN_TYPE: is a a character string that represents one of up 
to ten thousand Hispanic given name stems. 

4.8.2.4.2. 1 . A HFGN.TYPE is a GN.TYPE whose HLFREQ 

value is 1 (True). 

4.8.2.4.3. SET.ID: is an integer that is the numerical identifier for the 
set of related variants of the GN_„TYPE that is HF. 

4.8.2.4.3.1. The SET.ID will serve as the HFGN.KEY. 

4.8.2.4.3.2. Not every entry in the HGT will have a unique 
SET_ID; a distinct SET^ID is reserved for those 
GN_TYPEs where HLFREQ is True ( 1 ). 

4.8.2.4.4. HLFREQ: is an integer ( 1 , 0/True, False) that indicates if 
the GN_TYPE is or is not a HF GN segment. 

4.8.2.4.4.1. The frequency of all GN.TYPEs will be specified. 

4.8.2.4.4.2. True (1) will indicate a HF segment. 

4.8.2.4.4.3. False (0) will indicate a LF segment 

4.8.2.4.5. GNDR: is a single character value that indicates the gender 
oftheGN.TYPE. 

4.8.2.4.5.1. If the name is predictably female, the value will be 
F. 

4.8.2.4.5.2. If the name is predictably male, the value will be M. 

4.8.2.4.5.3. If the name is ambiguous or unknown, the value will 
beU. 

4.8.3. Purpose 

The HGT provides information about Hispanic given name segments. It 
indicates the frequency of the segments, their gender and the set of names of 
which they are the parent. 

4.8.4. Function 

The HGT serves as a resource for Hispanic Gender Identifier and the High 
Frequency Processor. 
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4.9, HIGH FREQUENCY GIVEN NAME VARIANT DATA STORE 
DECOMPOSITION 

4.9.1. Identification 

This data store is known as the High Frequency Given Name Variant Data 
Store (HFGV). 

4.9.2. Type 

4.9.2.1. The HFGV will be accessed by the High Frequency Processor. 

4.9.2.2. The HFGV will have about 60,000 to 90,000 rows. 

4.9.2.3. The HFGV will have the following format: 



DATA FIELD 


DATA TYPE 


FIELD SIZE 


VALUE RANGE 


ID_NO 


integer 


5 


00000... 99999 


HFGN_VAR 


character 


24 


alphabeiics 


SET.ID 


integer 


4 


0000... 9999 


DI.VAL 


decimal 


4 


0.00... 1. 00 



Figure 64: Example: Piece of HFGV 



ID_NO 


HFGN.VAR 


SET ID 


DI VAL 


00001 


JOSE 


0001 


1.00 


00002 


JOSEA 


0001 


0.73 


00003 


JOSSE 


0001 


0.73 


00004 


MARIA 


0002 


1.00 


00005 


MIRIA 


0002 


0.67 


00006 


MIRIAM 


0164 


1.00 


00007 


MIRIA 


0164 


0.77 



4.9.2,4. Definitions 

4.9.2.4.1. ID_NO: is a unique numerical identifier for each 
HFGN_VAR entry in the HFGV data store. 

4.9.2.4.2. HFGN^VAR: is character string that represents a GN 
segment that is a digraph variant of the HFGN_TYPE 
(HFGN_VAR whose DLV AL = 1 .00). 

4.9.2.4.3. SET_ID: is a unique identifier of the set of GN segments, 
that are variants of the same HFGN.TYPE. 

4.9.2.4.4. DI_VAL: is a two-place decimal value that indicates the 
digraph relationship between the HFGN^VAR and its parents 
HFGN_TYPE. 

4.9.3. Purpose 

The HFGV is a resource for defining t^he given name segments that will be 
stored with records added to the database. Storage of information about 
variant relations will speed retrieval and the filtering process. 
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4.9.4. Function 

The HFGV will be accessed by the HFP to assign keys to given name 
segments on record add and query. — ... 

4.10. LOW FREQUENCY SURNAME TYPE DATA STORE DECOMPOSITION 

4. 1 0. 1 . Identification 

This data store is known as the Low Frequency Surname Type Data Store 
(LFST). 
^i:i0.2.Type 

4.10.2.1. The LFST is a data store of LF keys. 

4.10.2.2. The LFST will have about 900,000 to I million rows. 

4. 10.2.3. The LFST will have the following format: 



Figure 65: Format: Low Frequency Surname Type Data Store (LFST) 

DATA FIELD I DATATYPE I FI ELD SIZE I VALUE RANGE_ 



ID NO 



LFSN.TYPE 



nteger 



character 



24 



000001... 999999 



alphabetics 



LFDIKEY 



character 



alphanumerics; char + char + # 



ID_NO 


LFSN_TYPE 


LFDIKEY 


000001 


AALVAREZ 


AAl 


000001 


AALVAREZ 


AA2 


000001 


AALVAREZ 


AL2 


000001 


AALVAREZ 


ALl 


000001 


AALVAREZ 


AL3 


000001 


AALVAREZ 


LV3 


000001 


AALVAREZ 


LV2 


000001 


AALVAREZ 


LV4 


000001 


AALVAREZ 


VA4 


000001 


AALVAREZ 


VA3 


000098 


BARRIOS 


BAl 


000098 


BARRIOS 


BA2 









4.10.2.4. Definitions: 

4. 1 0.2.4. 1 . ID_NO: is an arbitrary numerical reference to each 
LFSN_TYPE. The ID_NO will serve as the DLKEY. 

4.10.2.4.2. LFSN.TYPE: is the unique low frequency name segment 
as it occurs in the database; if there are multiple occurrences 
of the same name, they are represented by one entry, hence the 
term "type." 
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4. 10.2.4.3. LFDIKEY: is a siring of alphanumeric characters thai 
represents one digraph and its actual or derived position. 

4. 10.2.4.3. 1 . Up to ten LFDIKEYs will be associated with each 
LFSN_TYPE. 

4.10.2.4.3.2. An LFDIKEY is name-specific, so the same key 
may appear with other LFSN.TYPEs, in which case it 
will have a different ID^NO. 

4.10.2.4.3.3. A LFDIKEY is 

1) a digraph formed from the LF SN segment beginning 
with the leftmost character and its position (Base 
Key) and 

2) a positional variant on that digraph key (Position 
Key). 

4.10.2.4.3.4. Positional information will be associated with each 
digraph. 

4.10.2.4.3.5. To form a key, begin with the leftmost character 
and generate four digraph keys (Base Key) from the five 
leftmost characters of the LF SN segment. The first two 
characters form a digraph, the second and third 
characters form a digraph, the third and fourth characters 
form a digraph and the fourth and fifth characters form a 
digraph. Positional information (Positions 1, 2. 3, 4) 
will be included. 

4.10.2.4.3.6. Generate, from the Base Keys, up to six additional 
Position Keys; the position keys have the same 
characters as the Base Keys but contain different 
positional information. A maximum of ten keys (Base + 
Position) will be generated. 

4.10.2.4.3.6.1. Produce a Position Key on the first Base 
Key with Position 2. 

4. 10.2.4.3.6.2. Produce Position Keys on the second 
Base Key with Position 1 and Position 3. 

4.10.2.4.3.6.3. Produce Position Keys on the third Base 
Key with Position 2 and Position 4. 

4.10.2.4.3.6.4. Produce a Position Key on the fourth 
Base Key with Position 3. No Position Key is 
generated for Position 5 because the maximum of 
10 keys has been reached. 
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4.10.3. Purpose 

The LFST provides information that will limit the search of database records. 
Preprocessing of name types allows identification of relevant name segments 
without having to examine database records directly. 

4.10.4. Function 

The LFST will be accessed by the LFP. 

4. 11 . HISPANIC CHARACTER DATA STORE 
Identification 

This data store is known as the Hispanic Character Data Store (HCD). 
4.11.2. Type 

4. 1 1 .2. 1 . The HCD is a data store of all characters in Hispanic names and 
their predictable variants. 

4.11.2.2. Theformatof theHCD willbe: 



DATA FIELD 


DATA TYPE 


FIELD SIZE 


VALUE RANGE 


SETJD 


integer 


3 


000... 999 


CHAR 


character 


I 


aiphabetics 


CHAR VAR 


character 


1 


aiphabetics 


Figure 68: Examp 


le: Piece of HCD 




SET ID 


CHAR 


CHAR_VAR 


001 


B 


B 


001 


B 


V 


002 


S 


S 


002 


S 


Z 


004 


C 


c 


004 


C 


s 








037 


F 


F 


052 


K 


K 


078 


M 


M 


078 


M 


N 









4.11.2.3. Definitions 

4.11.2.3.1. SET^ED: is an arbitrary numerical that represents the set 
of characters that vary with one another. The SET.ID will be 
the GN_INIT Key, 

4.11.2.3.2. CHAR: is a single alphabetic character. Every alphabetic 
character will be represented. The CHAR is the type of 
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character, which may or may not have variants 
(CHAR^VAR). 

4.11.2.3.3. CHAR_VAR: is a single alphabetic character that may or 
may not vary predictably with other characters in written 
Spanish. A single character may participate in more than one 
set. 

4.11.3. Purpose 

Retrieval of records with HF SN segments from the database will be limited 
by the initial of the GN segments. For the retrieval to be sufficiently robust, 
however, the system must allow for some variation in the GN initials. The 
HCD indicates variations on initials. 

4.11.4. Function 

The HCD will be accessed by the HFP and will provide the source of the 
GN.INIT Keys that are to be generated for HF searches. 

4.12. TAQ FILTER DATA STORE DECOMPOSITION 

4.12.1. Identification 

This data store is known as the TAQ Filter Data Store (TF). 

4.12.2. Type 

4.12.2.1. This TF will be accessed by the Hispanic Filter and Sorter and 
provides parameter factors for matching TAQ DISREGARD lags during 
record filtering. 

4.12.2.2. The format of the TF follows: 



Figure 69: Format: TAQ Filter Matrix Design 



DATA FIELD 


DATA TYPE 


FIELD SIZE 


VALUE RANGE 


DATA VALUE 


TA0DIS#1 


character 


8 


alphabetics 


TAO_DISREGARD ITEM 


TA0DIS#2 


character 


8 


alphabeiics 


TAO_DISREGARD ITEM 


TF_VALUE 


decimal 


4 


0.00... 1.00 


Various (TBD) 



TA0DIS#1 


TA0DIS#2 


TF VALUE 


DE 


DE 


1.00 
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DE 


DEL 


0-90 


DE 


DE LOS 


0.90 


DE 


LOS 


0.75 


DE 


SAN 


0.75 


DE 


LA 


0.75 


DEL 


DEL 


LOO 


DEL 


DE LOS 


0.75 


DEL 


LOS 


0.65 


DEL 


LA 


0.85 


DEL 


SAN 


0.50 


DE LOS 


DE LOS 


LOO 


DE LQS.^ 


LOS 


0.90 


DELOS 


SAN 


0.50 


DELOS 


LA 


0.50 


SAN 


SAN 


LOO 


SAN 


LOS 


0.50 


SAN 


LA 


0.50 


LOS 


LOS 


1. 00 


LOS 


LA 


0.85 


LA 


LA 


LOO 









4.12.2.3. Definitions 

4.12.2.3.1. TAQDIS#1: is the TAQ DISREGARD segment that 
occurs in one or the other (different) of the comparands. 

4.12.2.3.2. TAQDIS#2: is the TAQ DISREGARD segment that 
occurs in one or the other (different) of the comparands. 

4.12.2.3.3. TF_VALUE: is the factor that will be used to adjust the 
SN.VAL or GN.VAL if the TAQDIS#1 and TAQDIS#2 are 
present in the comparands. 

4.12.3. Purpose 

Hispanic names often have peripheral name elements. Some of these. make up 
a segment of the name, the TAQ values identified in the TF. Their relative 
value, however, varies. Some of them cannot cooccur, some have opposite 
meanings, so it is necessary to identify their relative value when they are 
contrasted with one another. 

4.12.4. Function 

The TF provides the resources for the HFS to determine the relative value of 

TAQs that occur in two comparands. 
4.13. HISPANIC PARAMETER DATA STORE DECOMPOSITION 

4.13.1. Identification 

This module is known as the Hispanic Parameter Data Store (HPD). 
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4.13.2. Type 

4. 1 3.2. 1 . The HPD is a data store that will be accessed by the Filter 
Component of the-Hispanic Filter and Sorter (HFS). 

4.13.2.2. The HPD is a parameter table that will be accessible to the user and 
whose cell values will be determined through testing and comparative 
evaluation. 

4.13.2.3. The HPD has the following format: 



Figure 7t: Format: Hispanic Parameter Data Store (HPD) 



DATA FIELD 


DATA TYPE 


FIELD SIZE 


VALUE RANGE 


DATA VALUE 


PARM_NAME 


character 


6 


alphabetics 


SNTHR, GNTHR. OPSVAL. 
OPGVAL, INITSN. INITGN. 
RGNDR. TAQASN. TAQAGN. 
TAQXSN. TAQXGN, RL#. YOB#. 
COB#. etc. 


PARM.VAL 


decimal 


4 


0.00... 1.99 


Various fTBD) 



FARM NAME 


PARM_VAL 


SNTHR 


0.60 


GNTHR 


0.6S 


LFDIKEY THRESHOLD 


0.57 
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DLVAL THRESHOLD 


0.63 


HFGV THRESHOLD 


0.65 


HFSV THRESHOLD 


0.65 


OPGVAL 


0.60 


OPSVAL 


0.60 


ASVAL 


0.65 


AGVAL 


0.65 


INITSN 


0.85 


INITGN 


0.85 


INITNM 


0.80 


RGNDR^: 


0.65 


TAOASN 


0.90 


TAOAGN 


0.90 


TAOXSN 


0.85 


TAQXGN 


0.85 


RLO 


1.20 


RLl 


1.15 


RL2 


1.10 


RL3 


1.05 


RL4 


1.00 


YOBO 


1.30 


YOBl 


1.25 


Y0B2 


1.20 


YOB3 


1.15 


Y0B4 


1.10 


Y0B5 


1.05 


Y0B6 


1.00 


COBl 


1.20 


C0B2 


I.IS 


C0B3 


1.10 


C0B4 


1.00. 


COBS 


0.95 



4. 13.2.4. The values provided are for example only and do not necessarily 
represent the PARM^VALs to be used for the parameters: 

13.3. Purpose 

The HPD is a data store that allows easy access to adjustable thresholds for 
record qualification, to thresholds for data store updates, and to parameters 
that contribute to the determination of the name scores (SN_VAL, GN^VAL) 
and to the Composite Score of two record comparands. 

13.4. Function 

The HP functions as an independent data store with thresholds needed by the 
LFP and all the parameters needed by the HPS during the filtering process. 

4. 14, REFUSAL CODE CATEGORY DATA STORE DECOMPOSITION 

4.14.1. Identification 

This data store is known as the Refusal Code Level Data Store (RCL). 



4. 



4. 
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4.14.2. Type 

4.14.2.1. It is recommended that the RCL be a parameter file, which can be 
accessed by the client so RC categories can be added to or changed with 
ease. 

4.14.2.2. The RC data store will provide a list of the Refusal Codes and its 
Refusal Category, which is an indication of the level of seriousness of 
each Refusal Code. 

4.14.2.3. The RCL will be referred to by the Hispanic Decision Matrix 
(HDM) and by the LFP and Hispanic Filter and Sorter. 

4.14.2.4. The RCL has the following format: 



Figure 73: Format: Refusal Code Level Data Store (RCL) 



DATA FIELD 


DATA TYPE 


FIELD 
SIZE 


DATA VALUE 


REFUSAL CODE 


alphanumerics 


3 


Standard Refusal Codes 


REF CAT 


alphanumerics 


3 


RLO, RL1,RL2. RL3, RL4 



Figure 74: Example: RCL (REF_CATs for e xample only) 



REFUSAL CODE 


REFLOAT 


00 


RLO 


23 


RLl 


6C 


RL2 


07 


RL3 


0 


RL4 



4.14.2,5. Definitions 

4. 14.2.5. 1 . REFUSAL CODE: indicates each Visa Refusal Code 
(Codes and their Refusal Level (see VALUE) are for example 
only; they do not represent the complete list nor the accurate 
assignment of a Refusal Code to a Refusal Level). 

4.14.2.5.2. REF.CAT: The RL# will appear in the form RLl. RL2, 
etc. 

4.14.2.5.2.1. RL# is the Refusal Category to which a particular 
Refusal Code has been assigned. The Visa Office will 
assign Refusal Codes to one of 4 categories: RLl , RL2, 
RL3. RL4; RLO is reserved for the Refusal Code 00. 
(The current distinction among Refusal Codes is a binary 
one: serious and non-serious. Assignment of Refusal 
Codes to more groups has not yet been done; the 
consequence is that one or more of these categories may 
not currently have a distinct value.) The RL# occurs in 
ascending order, from most serious to least serious 
Refusal Code. The RL# will be linked to a Year-of- 
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Birth Code (see Section 4. 16) to determine the relevant 
subsets of records to be searched. 

4.14.2.5.2.2. RCO refers to the Refusal Code 00 

4.14.2.5.2.3. RCl refers to all Refusal Codes that have been 
designated as Type 1 Serious RC 1. i.e., the most 
serious, excluding 00. 

4.14.2.5.2.4. RC2 refers to all Refusal Codes that have been 
designated as Type 2 Serious RC, i.e.. serious but less 
serious than RCO and RC 1 . 

4.14.2.5.2.5. RC3 refers to all Refusal Codes that have been 
designated as Type 1 Non-Serious RC. These codes are • 
less-serious than the RCO. RCl and RC2 codes. 

4.14.2.5.2.6. RC4 refers to Refusal Codes that have been 
designated as Type 2 Non-Serious. These codes are the 
least serious codes, less serious than the RCO. RCl , RC2 
and RC3 codes. 

4.14.3. Purpose 

It has long been desirable to make more granular distinctions among the 
Refusal Codes. For many years, DOS has maintained a distinction between 
serious and non-serious codes; these different categories were correlated with 
different YOB search ranges. However, a mechanism for making greater 
distinctions will provide greater flexibility in delimiting the set to be 
retrieved during the first stage of record analysis, especially for Hispanic high 
frequency names, where more restricted retrievals are highly desirable. The 
introduction of five refusal code levels also provides the opportunity to 
correlate more year-of-birth ranges to the refusal code levels. 

4.14.4. Function 

The RCL provides information needed for the evaluation of record proximity 
in the Hispanic filtering process and contributes to the delimitation of - 
database records retrieved through the RLYOB Data Store. 

4. 1 5. YEAR-OF-BIRTH RANGE DATA STORE DECOMPOSITION 

4.15.1. Identification 

This data store is known as the Year-of-Birth Range Data Store (YR). 
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4.15.2. Type 



4. 15.2.1. It is recommended that the YR be a parameter file, which can be 
accessed by the client so YOB ranges can be set. Alternatively, it could 
be represented as a system parameter whose value(s) are set in an .ini 
file. 

4.15.2.2. The YR will define the YOB ranges that will be associated with a 
Refusal Level (see Section 4.16). 

4. 15.2.3. This data store has the following format: 



Figure 75: Format: Year-of>Birth Range Data Store (YR) 



DATA FIELD 


DATA TYPE 


FIELD SIZE 


VALUE 


DATA DEFINITION 


YOBO 


integer 


1 


0 


exact date of birth 


YOBl 


character 


1 


A 


exact year, inverted month and day 


Y0B2 


character 


I 


B 


exact year of birth 


Y0B3 


integer 


2 


L..99 


narrow year of birth range 


Y0B4 


integer 


2 


1...99 


standard year of binh range 


YOBS 


integer 


2 


1...99 


wide year of birth range 


Y0B6 


integer 


2 


1...99 


unlimited year of birth range 



4.15.2.4. Definitions 

4.15.2.4.1. YOB# is the Year-of-Birth Range category whose value 
indicates the year-of-birth range to be searched. The year-of- 
birth VALUE indicates the search range, that is, the number of 
years on either side of a given year-of-birth to be searched. 
For example, if the input year is 1962 and Y0B3 range is 4, 
the search will cover a range of nine years, 1958-1966. The 
range includes the full year, so all of 1958 and all of 1966. 

4. 1 5.2.4. 1 . 1 . There are seven YOB# categories. YOBO, YOB 1 , 
Y0B2, Y0B3, YOB 4. Y0B5, Y0B6. 

• YOBO is a single integer that refers to an exact 
month, day, year of birth. If YOBO is specified, the 
system must be able to match the month, day and 
year of the Date of Birth of an input record and a 
database record, 

• YOBl is a single character (A) that refers to an 
exact year-of-binh with the month and day inverted. 

• If YOB 1 is specified, the system must be able 
to match the year of Date of Birth and an 
inverted month and day (DEC 03 -> MAR 12) 
of the input record and the database record. 

• YOB 1 be relevant to the Hispanic Filter 
and Sorter, but may not function as a search 
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parameter since the value would be subsumed 
in Y0B2. 

• YOB2 is a single character (B) that refers to an 
exact year-of-binh. If Y0B2 is specified, the 
system must be able to match the year of the Date- 
of-Birth of an input record and a database record. 

• Y0B3 is a one- or two-place integer ( 1 . . .99) that 
refers to a narrow year-of-birth range. Narrow year- 
of-birth range is usually defined as 1 year (for a 
search range of 3 years). 

• Y0B4 is one- or two-place integer ( 1 ... 99) that 
refers to a standard year-of-birth range. Standard 
year-of-birth range is usually defined as 3 years (for 
a search range of 7 years). 

• YOBS is a one- or two-place integer ( 1 . . .99) that 
refers to a wide year-of-birth range. Wide year-of- 
birth range is usually defined as 5 years (for a search 
range of 1 1 years). 

• YOB6 is a one- or two-place integer ( 1 . . .99) that 
refers to an unlimited or extremely wide year-of- 
birth range. Unlimited year-of-birth range would be 
set sufficiently high to include all (or all desired) 
years-of-birth in the database (e.g., 50). 

4.15.3. Purpose 

This YR provides a greater granularity in the year-of-birth range and, 
therefore, greater flexibility in delimiting the set to be retrieved during the first 
stage of record analysis. The correlation of five refusal code levels to different 
year-of-birth ranges will help to delimit the number of records to be searched 
and to define the more valuable set of records. For the Hispanic processor, 
greater precision in the year-of-birth range is especially important in the High 
Frequency Processor where more restricted retrievals are highly desirable. 

4.15.4. Function 

4. 15.4. 1 . The YR permits greater granularity in the Date-of-Birth types related 
to the system. 

4.15.4.2. The YR will be accessed by the Refusal Code Level/YOB Range 
Data Store, which will limit the retrieval range in the Hispanic Search 
Engine. 

4. 15.4.3. The YR data store will define the YOB ranges referred to in the 
Hispanic Decision Matrix (HDM). 

4.15.4.4. The YR will contribute to the Hispanic Filter and Sorter to 
contribute information to the composite score. 
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4.16. REFUSAL CODE LEVEL / YOB RANGE DATA STORE MODULE 
DECOMPOSITION 

4.16.1. Identification 

This data store is known as the Refusal Code LevelA'OB Range Data Store 
(RLYOB). 

4.16.2. Type 

4.16.2.1. The RLYOB is a matrix that merges the values in the Refusal Code 
Level (RCL) Data Store and the Year-of-Birth Range (YR) Data Store. 

4. 16.2.2. For each Refusal Level (RL). a Year-of-Birth (YOB) Range is 

specified. 

4.1 6.2.2. L Only one YOB Range for each RL is permitted. 
4.16.2.2.2. The same YOB Range may apply to more than one RL. . 

4.16.2.3. The RLYOB has the following format: 



Figure 76: Format:. Refusal Level/Year-of-Birth Range Data Store (RLYOB) 



DATA FIELD 


DATA TYPE 


FIELD SIZE 


VALUE RANGE 


DATA VALUE 


RL# 


character 


3 


RL0...4 


RLO. RLLRL2. RL3. RL4 


YOB# 


characier 


4 


YOB0...6 


YOBO. YOBl. Y0B2. Y0B3. Y0B4, YOBS. YOB( 



Figure 77: Example: R LYOB Data Store 



RL# 


YOB# 


RLO 


Y0B5 


RLl 


Y0B4 


RL2 


Y0B3 


RL3 


Y0B3 


RL4 


Y0B2 



4.16.2.4. Definitions: 

4.16.2.5. RL#: is a character siring that indicates the Refusal Level of the 
Refusal Code. 

4.16.2.6. YOB#: is a character string that indicates the Date-of-Birth Range 
Category of the comparands. 

4.16.3. Purpose 

Retrieval of records from the database should be delimited by a relationship 
between the Refusal Code Level and the Year-of-Birth Range. It will restrict 
the number of records to be reviewed. 

4.16.4. Function 

The RLYOB is a resource for the Hispanic Search Engine to delimit the LF 
records retrieved from the database. ^ . 
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4.17. COUNTRY-OF-BIRTH PROXIMITY DATA STORE 

4.17.1. Identification 

This module is known as the Country-of-Birth Proximity Data Store 
(COBPROX). 
•4.17.2. Type 

4.17.2.1. The COBPROX is a data store whose cells contain a decimal value 
that reflects the degree of relationship between the country represented 
on two country-of-birth. 

-^*v<«., 4.17.2.2. The COBPROX has the following format: 



Figure 78: Design: COBPROX Data Store 



DATA FIELD 


DATA TYPE 


FIELD SIZE 


VALUE RANGE 


DATA VALUE 


COB#l 


character 


4 


alphabetics 


COB Code- 


C0B#2 


character 


4 


alphabetics 


COBCode 


COBVAL 


decimal 


4 


0.00... 1.00 


Various 



Figure 79: Example: Piece of COBPROX Data Store (COBVAL for example only) 



COB#l 


COB#2 


COBVAL 


AGS 


AGS 


1. 00 


AGS 


GRBR 


0.05 


AGS 


VTNM 


0.05 


AGS 


MORO 


0.05 


AGS 


SYR 


0.05 


ALG 


ALG 


LOO 


ALG 


MORO 


0.85 


ALG 


GRBR 


0.05 


ALG 


VTNM 


0.05 


MORO 


MORO 


1.00 


MORO 


GRBR 


0.05 


MORO 


VTNM 


0.05 


GRBR 


GRBR 


LOO 


gr6r 


VTNM 


0.05 


VTNM 


VTNM 


LOO 
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4,17.2.3. Definitions: 

4.17.2.3.1. C0B#1: is the 4-character Country-of-Birth Code of one 
of the comparands. 

4.17.2.3.2. C0B#2: is the 4-character Country-of-Birth Code of one 
of the comparands. 

4.17.2.3.3. COBVAL: is the decimal value assigned through the 
HCOB and other COB Category Data Stores that are cuRure- 
specific (as they are developed). A default value will be 
assigned for those COBs that do not enter into special 
relations. The COBVAL indicates the degree of relationship 
between the two COBs. 

-4.17.3. Purpose 

The COBPROX Data Store provides information on the relative value of the 
COBs in two comparands. This value can serve to limit the COBs that are 
accessed for retrieval. 
4.17.4. Function 

The COBPROX is populated by the HCOB and any other partition-specific 
Country-of-Birth Category Data Stores. The COBPROX provides COB 
relationship information. 

4.18. HISPANIC COUNTRY-OF-BIRTH CATEGORY DATA STORE 
DECOMPOSTION 

4.18.1. Identification 

This data store is known as the Hispanic Country-of-Birth Category Data 
Store (HCOB). 

4.18.2, Type 

This HCOB is a data store that will serve as the source of information for the 
COBPROX Data Store, populating the COBVAL, and will provide the COB 
Category (COBCAT) necessary for the Hispanic Filter and Sprter. 



Figure 80: Design: Hispanic Country-of-Birth Category Data Store (HCOB) 
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Figure 81: Example: Piece of HCOB (Values for example only.) 
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4.18.3. Definitions 

4.18.3.1. C0B#1: is the 4-character COB Code of one of the comparands. 

4.18.3.2. C0B#2: is the 4-character COB Code of one of the comparands. 

4.18.3.3. COBCAT: is the category assigned to the relationship of two 
COBs. 

4.18.3.3.1. Categories might be defined as Exact, State, Geographic 
Region. Dialect Region. 

4.18.3.3.2. All relationships are adjustable. 

4.18.3.3.3. Example COB Categories are: 

• COBl: Exact represents an exact match of the COBs: 
AGS/AGS; the COBPROXVAL would be 1 .00. 

• C0B2: State Relationship represents the set of COBs 
that are states within one country (currently only the 
Mexican States qualify). The score would be something 
less than that applied to an exact match but nonetheless 
high: 0.95. 

• C0B3: Northern South America represents the set of 
COBs that are in close geographic proximity and share 
naming conventions: COITVENE. The value assigned 
would be less than that for C0B2: 0.85. 

• C0B4: All Latin America refers to all COBs in Central 
and South America and the Spanish-speaking Caribbean. 
The value assigned would be less than that for C0B2: 
0.65. 

• COBS: Similar refers to COBs that have qualified as 
Hispanic but may not exhibit Hispanic naming 
conventions: Brazil, Portugal. 

• C0B6: All refers to all COBs and is assigned a value that 
will allow the searfch of all COBs; it \yould be the lowest 
decimal value used. 
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4,18.3.4. COBVAL: is the decimal value that will be assigned to a particular 
COB relationship; this value will be used to determine the COBs that 
will be permitted in the retrieval process. 

4.18.4. Purpose 

Pre-defined COB category relationships will provide a definition of the values 
that appear in the COBPROX Data Store. 

4.18.5. Function 

These COB categories will provide information about COB relationships that 
will contribute to determination of the Composite Score in the Arabic Filter 
•'>«y-5«.> and Sorter. 
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Abstract 

This paper de..ribes a n.o-vear research effor, to i„corpora,a phomlogical f'^f^'" ■ 
in ollmate<inan,e searching. Specifi^^^^^^^ 

charuciers ar, automatically am.erted to n^ultiple phoncc representai ort... hasad o» sets 

Xt^U,r expression- that relate character strings ,o predictable sounds or sound 

tuZcesufinga^iJelyacceptedphoneticnotatiortsyste.^ 

Alphabet. Names are retrieved >.her, there is an irttersectort ojthe re^-ular <-'^^'-^;- 

the^uery natrte .ith regular expressions of names in a preprocessed datahas,. Add.nonal 

similar names can be retrieved based on the articulawry charaCertst.cs of the sound 

segments contained In the query and database names. 

1.0 Introduction 

Variation iu the spellings of names is a persistent issue in the area of automated name 
searching in large databases (Hermansen. 1985). In general, the source of spelhng 
variation of names can be analyzed and explained a posteriori . Predicting any '" Jv^d^^ 
spelling, however, remains problematic. Sources for spelling vanat.on mclude: keyboanku=- 
based date entry errors (£^^hiffiiS-tfie'«!i®W k^yrOenning for «e«»<««)j;syn*»cuc 
'^/aHkiSrCergrouT^o^sequence given name and surname such as Richard T^""'"" 
Thomas Richard), morphological variation (e.g., truncated strings suclvas or /? for 
Richard) and semantical ly-based variation (e.g.. nativizations such as Gold^ater for 
Coldwa.-<ser). Of interest in the current paper is variation due to orthographic conventions 
fe B English can represent the same sound in more than one way, as in Stephen - Steven) 
and articulatoty variation (e.g.. the/) in Thom,>.wn is a predictable spelling of Thomson 
based on principles of articulation). While there are multiple sources ot name variation, 
this paper will present evidence 1) that the inherent ambiguity in the English use of roman 
characters can be mitigated by multiple mappings to unambiguous phonetic charactcis and 
■>) that phonologically-similar names can be iririeved through the analysis of sounds into 
Iheir articulaiory features (i.e.. place and manner of articulation). It is based on research 
conducted from September of 1 995 through the present. 
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2.0 Statement of Problem 

Character-based naiiic searching relies on spelling as the basis for calculating disiancc 
between he query nan« and the database name. While spelling us.ng roman cha^cters .s 
ro3ated to pronunciation, the relationship between the two .s olton -ncons.s m 
(Cumm ngs 1 98?). and the orthographic information (i.e., convenuons ot the spc ng 
vst?m of a language) is at times misleading. Thus, one spelbng may map to "i"'' P'^ 
™ciations l1 can be pronounced to rhyme with pu,s, c«/.v or .;,a./.v and a le^t 
Tv r " ditional non-English pronunciations are possible. The ^/^X" 
•also the'case- there may be a number of ways of reprcsentmg a s.ngle pronunc.alion. 
Swi and for example, are usually pronounced .demically by F.ngl.sh speakers. 

Character-matching techniques assume a reliable relationship between the orthographic 
system and the pronunciation. This assumption is flawed because the goodness ojp 
be ween orthography and pronunciation, especially for English, is n,««^/«-m««v. that .s a 
gSroman character can'stand for more than one sound, and --^-^^^^^--^^^^^'^y ^ 
represented in more than one way in the spelling system. Thus, Ihe sound (H 
S ten as/(F..«*).#(r«/^). ph ^PMUp) or even ,h (Rou.h). Converse ly he d.graph 
may represent the fH sound Rough, be silent (Dough), or represent [k] (m ome 
pronunciations of McC/««gWm). [h] (in Moncgham), [g] (m McOhec) or [gh] (across 
syllable breaks, as in Bighouse). 

While much name variation can be traced to non-phonological issues, including syntax 
(order of name segments), aliases (John Doe for John DilUn^erl morphological issues 
(Pes: for Margaret) or data entr>' errors, many name variants can be traced to me 
relationship between orthography and pronunciation. Orally transmitted names,. for 
instance, are especially prone to guesses on iho part of the transcriber as to the official 
(i e legal) spelling of an individual's name. I .anguage contact can account for some 
spelling variants as well (French Beuuchamp and Anglicized Deechaml as can 
transcription from non-roman character sets {IVachmi and Ouakhmi^Xie Hsieh and Sye) 
and sound change over lime (e.g.. Lei^ih is now pronounced the same as Lee), 

Additionally, regular (i.e., predictable) processes of speech produce variability, in how a 
name may be written. Thus, the presence of the letter/) in Thompson is an artifact ot poor 
articulaiory timing as the articulators move from a nasal fm] to an oral [s]. (The variant 
spelling Thomson reflects a more etymologically justified spelling.) 

3.0 Name Representation: Spelling 

LAS has been investigating the feasibility and utility of incorporating information aboul 
the pronunciation of characters into the automated name searching process. The 
researchers considered a number of options, including an acoustic-level of represenUilion 
and character-based rules, and determined thai searching of character-based databases 
could be enhanced to include predictable language-based information, aboul charactcr-to- 

' Square brackets indicate that a sound is being represented, raJher than a spelling. 
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sound mappings. Specifically, LAS recommended the use of the stock of phonetic 
symbols known as the International Phonetic Alphabet (IPA), widely used by linguists to 
represent the inventory of sounds used in the world's languages, and otTiciaily adopted by 
the International Phonetic Association (Laver. 1994). The IPA uses a closed set of 
symbols to transcribe speech in ways that are interpretable unambiguously by linguists, 
regardless of the language being described. (See Appendix A.) For example, the symbol 

[i] (placed between brackets to indicate that ii represents a sound ratlicr than a letter) 

always stands for a voiceless labiodental fricative, as in English thigh, while [] always 
stand s'for the equivalent vo/ce^/ labiodental fricative, as in English thy. Thus, IPA 
disambiguates the English orthographic pattern of using //; to stand for either sound: thigh 
[iaj] versus thy [aj], A name such as GaHhai\ of course, might be pronounced with either 
of these sounds, and would thus have two IPA representations, one for each pronunciation: 
[geir] versus [ger]. There is international agreement by members of the International 
Phonetic Association, founded in 1889, as to the interpretation of IPA symbols. A re- 
evaluation of the stock of symbols and special diacritic marks took place.at the 1989 IPA 
Convention in Kiel, and the efforts of the Association have resulted in the unambiguous 
mapping of sounds onto IPA symbols that transcends individual speakers or languages 
(Laver, ibid.). 

4.0 Mapping Spelling to Sound 

The issue of how to predict pronunciation of names from orthography is far from trivial. 
Two key considerations include that: 

• pronunciations of proper names are far less uniform than pronunciations of other 
vocabulary. The pronunciation ol'the noun dough is morc-or-iess fixed in English, 
despite the fossilized spelling that can be traced to an earlier pronunciation. The 
pronunciation of the name Lough is far less certain: individuals named Lough may 
well vary in their pronunciation of the family name and, even if all families named 
Lough could reach a consensus, there is no assurance that those unJamiliar with their 
consensus would guess that pronunciation. Additionally, some names retairr old • 
spellings that map to modern pronunciations in highly improbable ways (e.g., British 
Cholmondeley is commonly pronounced the same as Chumley). Claims of ^'correct** 
pronunciations carry little weight in terms of name searching; 

and: 

• orthographies are language-specific. The pronunciation of the letter x regularly maps 
to [ks] and [z] in English {Alexander, Xcnia), is regularly silent word-fmally in French 
orthography (LaCroix), stands for the velar fricative [x], or [s] in Spanish (Mexico, 
Xochimilco), and a [dz] or [] in Albanian (Hoxha). Additionally, standardized 
transcription systems from non-roman systems to roman exploit the letter x to stand for 
other, non-English sounds (e.g., Chinese .V/c. Greek ^Vr/.y/av). Finally, any name may 
be nativized to fit the "borrower" language: spellings of non-Anglo names may be 
pronounced according to English orthographic conventions (e.g., French Duqucsne 
pronounced [dukwzni].) 
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5.0 Writing IPA Conversion Rules 

IPA is an effective noiational system for representing pronunciation. LAS has wrilien sets 
of rules thai relate spellings to sounds. The rules are language-based, with sets of rules 
operating for Arabic, Mandarin Chinese, Hispanic and Anglo names. The rules assume: 

• 26-character sets of roman letters, absent all diacritic markings, including accent marks 
or tone indicators; 

• English speakers, either nai ve or expert in the language of origin; 

• one spelling can map to muhiple pronunciations. 

The rule sets were written to specific development databases made of single name 
elements, cither surname or given, and taken from a variety of sources, including the U.S. 
Census list of the most frequent names in the U.S. and large U.S. databases of names from 
other countries. The names were manually tagged as ^'Arabic", "Mandarin Chinese", 
"Hispanic" and "Anglo", where "Anglo" was loosely interpreted to include Western 
European Germanic names (including Dutch and German). A team of linguists used a 
variety of sources to deiennine possible pronunciations, including native speaker 
knowledge and textual information (e.g., Cummings, 1088, Hanks and Hodges. 1989, 
1990, Symonds, 1986). In general, rules were written broadly in order to ensure that most 
plausible pronunciations were captured. The Arabic and Mandarin Chiiiese rules included 
transcription variation (e.g., Chinese pinyin, Wade-Giles and Yale conventions of 
rendering Chinese names into roman script, as in Xie/Hsieh/Syc). The sample Anglo rule 
below is interpreted to mean that the letters sc preceded by anything and followed by the 
letters le can be pronounced as [s] or [sk] (e.g.. Muscle and Mosclin): 

sc/ anything le [sk?] 

Rules were implemented using standard regular expre.s.sion notation. The following table 
shows a sample query and the names returned from a data file containing the 88,799 most 
frequent surnames from the U.S. census: 



Search on SMITH 

SMITH 

SM)TH 
SMITHE 

SMJT 
SMYTHE 

SMIDT 

SMIHT 
SZMIDT 



Figure 1 Search on name SMITH 
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As an example of the advantages of matching on IPA, consider a query on the name Ixa. 
Converted to the IPA string [11], cxatM-matches with numerous spelling variants are 
automatic, including Lei^h and Li. Typical character-based matches will fail lo retrieve 
Lei^h or Li. since the percentage of character overlap is minimal. Conversely, a standard 
index matchijig system such as Soundex will categorize Let and Li identically, but will stili 
miss Leigh, given ihe presence of a salient letter and will retrieve a large number of 
names of low relevance, including Liu Liao, Low, Louie. Luhoya and Lehew. 

6.0 PHonological Processes 

In addition to predictable spelling variation, rules were written to account for predictable 
ariiculatory processes (MacICay, 1987; Wolfram and Johnson, 1982). For example, the 
variant spellings of Thomson — Thomp.son, Sinj.son — Simpson. Demsey - Dewpsey, etc. ' 
can be accounted for by regular movement of the velum (i.e., the soft palate) from a 
bilabial nasal [m] to an oral [s]. Production of an intrusive bilabial oral [p] is entirely a 
result of the liming of the movement from nasal to oral articulation. LAS incorporated 
likely articulatory variation into the IPA rule sets. Thus, a query of the name Thomson will 
retrieve the variant Thompson as an exact match. 

7.0 Testing the Rule Sets 

To test the net effect of the Orthography-lo-lPA rules, LAS conducted a controlled lest of 
the rules by randomly selecting 1 57 test names from a database of 55,545/ The database 
contained names that were from sources idenlilled as Arabic, Mandarin Chinese, Hispanic 
and Anglo (again, broadly defined). A native speaker of educated standard American 
English was asked to record the 157 lest names using pronunciations of his choosing. The 
audio recordings were played for native speakers of American English, who were asked to 
write one or more "likely" spelling for each name. LAS eliciied 3,689 variants in all by 
playing the recordings to native speakers of American English. The variant spellings were 
then used as test query names to calculate the retrieval rates of the original name spellings. 
Overall, 69% of all variant spellings were retrieved by the IPA rules. However, qualitative 
analysis of the results showed that approximately 23% of the variant names not retrieved 
were due to perceptual mishearings of the recorded names. For example, the variant 
spellings of the test name Dau^hn predictably included Dcihn, Bavn, and Bonn, and the IPA 
Conversion Rules succeeded in mapping all to the original test name spelling. However, a 
fourth elicited spelling, Vaughn, was not predicted, and the IPA Conversion Rules did not 
map it to Baughn. The mishearing of [v] for |b] is not unusual, given the acoustics shared 
by the two sounds. The IPA Conversion Rules, which include regular articulatory variants 
such as Thomson/Thompson, were purposely not intended to retrieve nerceotuallv similar 
names during the current phase of research. 
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8.0 Fuzzy Matches: Articulatory Similarity 

Al the heart of the research has been an effort lo impr^vvo-the ^"7^'';;^ "^^^^^^^^^^^'"^ .^^ 
process by retrieving names that are shnilar to the query name. The I PA Convers.on Rule, 
are able to capture a good deal of najiie variation that can be attributed to orthographic 
Tources. whether intilingual (e.g.. Z-cgVLc.) or interlingual (e.g. -"-"P'^-J-"-- 
orthography from Chinese: Xie ~ Hsich ~ Syc). An additional goal has been to retrie ve 
°ame? that are not phonologically identical to the query name but thai a caretu analyst 
would like to consider before abandoning a search. Thus, while spelling vanants of the 
•namellinfe include DM and Denck. the analyst might want to consider names that 
seem phonologically close to the query name without being a predictable variant (e.g., 
Bcnge: Bankeymd perhaps even names like I'enke. Pmkc or Denische). While most 
search algorithms permit fuzzy matches, these are Invariably based on calculations ot 
number of characters shared. From the perspective of character niatching the letter h is as 
different from the p as it is from x, y or Thus, to pencil retrieval ot Penke for Bcnk^ is to 
require retrieval of any name that differs from the query by the first character, including 
Xenke Yenkc and Zcnkc. This clearly does not follow any phonologically reliable . 
principle, and significantly reduces the etTicicncy of automatic retrieval. Even indexed 
systems, such as Soundex. group letters as either co-indexed or unrelated. Thus, while 
Soundex is often called "phonetic" because it groups letters that share some phonological 
characteristics, it cannot compare the degree to which two sounds, or indeed two names are 
related: it lacks granularity. Thus, Soundex would treat Benke, Penke and Panke as 
identical rather than similar. Soundex would exclude Benlsche firom the group because ot 
the letter t in the spelling, in effect treating Oenische as being equally di.stant from Benke as 
from Smith. 

It is clear, however, that sound segments can be analyzed in terms of their articulatory 
characteristics, and that some sounds fall into natural categories, such as vowels and 
consonams. Properties of sounds have been described in detail by a number of linguistic 
analyses according to place and manner of articulation (e.g., [p] and [b] are both articulated 
at the lips by complete blockage of the air flow and sudden release of pressure). One of the 
best known descriptions of phonetic classification is that of the American linguists 
Chomsky and Halle (1968). All the distinct sounds of American English can be described 
using 15 distinctive features (see Appendices li and C). By classifying, sounds according 
to these distinctive features, a fairly clear picture emerges of how close any two sounds are 
to one another. Thus, [p] and [b] differ by just one feature, voicing, while (p] and [f] differ 
by three and [p] and [v] by four. In general, articulatory distance can be counted tn terms 
of how many articulatory characteristics sounds share. 

LAS created a file of feature differences between pairs of sounds, essentially mapping 
phonetic features onto I PA noution. By relaxing the threshold of allowable differences, 
increasingly distant sounds are retrieved. Thus, by permitting matches oflPA characters 
that are not exact matches, names are retrieved that are phonologically close. Even IPA 
sound-to-sound comparisons yield interesting sets of names for comparison. By relaxing 
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retrievals to include single feature differences, a search of ihe name Smith now brings back 
these additional names: 



Search on SMITH 
Feature Difference Threshold: 1 

SMID 
SMEAD 
SNITH 
SNIPE 
SNIDE 
SNEED 
SNEAD 
SNAPE 
SNEATH 



Figure 2 Fuzzy Search on Smith mesisuring Phonetic Feature Differences 

Viewed in physiological terms, this is reasonable. Phonetic features refer to salient 
characteristics of articulation, so that differences generally reflect how likely it is that any 
two sounds would be articulated in place of another. There are numerous additional 
factors, of course, that ought to be considered in measuring how similar two names are to 
one another aniculatorily. 

9.0 Final Sorting of Names Retrieved 

The names retrieved by searches on phonetic features may not all be of equal relevance to 
the query name. Additional factors arc under consideration to sort names retrieved, based 
on a variety of phonological characteristics. 

9.1 Sonority Level 

The differences in phonetic features generally express the amount of effort needed to move 
articulators from one sound to another. The sounds [p], [t] and [k] form a natural class of 
voiceless stop consonants — identical in manner of articulation. All are extremely 
common in the world*s languages, and are among the first acquired by children. They 
differ in place of articulation, and this is reHected in feature differences. However, manner 
of articulation is probably a belter measure of energy expenditure than is place of 
articulation: voiceless stops are all extremely low in sonority, that is, the amount of energy 
needed to produce a sound. Vowels, on the other hand, require much more effort: they, in 
essence carry the sound wave. In order for ieaiure differences to effectively measure level 
of effort required, differences should be weighted according to sonority level. In general 
terms, sounds fall into nine levels of sonority, with voiceless stops [p], [t] and [k] at the 
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low end and the vowels [ ] as \r\ father and [©] as \n fan at the most sonorous end 
(Ladefoged, 1982). Sorts of names retrieved ought to consider the sonority value of 
sounds. This might be accomplished by weighting phonetic features or by a more 
complicated comparison of sonority level contours of names or syllables. 

9.2 Syllabification 

Additionally, in languages that time segments based in part on stress patterns, it is 
reasonable to compare stressed syllables to one another. In the following example, names 
have^B^eii aligned in terms of substrings, in this case corresponding to syllables: 

Chester: [ t* otp] 

Chesterton: [i* atp tv] 

Winchester: [co v t* crip ] 

Both in terms of articulatory effort (sonority) and psychological .salience, it would be 
misleading to treat all three occurrences of the substring [t*] as equivalent: stress clearly 
must be included in the equation. LAS has written a syllabifier that automatically parses 
English IPA strings, including names, according to a set of rules. I- uture research will 
inveisligate the possibility of ranking similar names through analysis at the syllabic level. 
Syllabic level analysis has the strength of lining up comparable substructures of names. 
All syllables share llie same internal structures (i.e., onset of the syllabic, nucleus, and 
coda), and alignment by syllable enables meaningful comparisons oi" internal structures of 
names (where a period represents the syllable break): 

Linda [X V . 6 ] 
Lisa I _ ■ CT ] 

Note that in the above example, the coda (i.e.. end) of the first syllable in Linda is filled by 
[n I but empty in Lisa, as indicated by the underscore. A meaningful comparison of the two 
names would compare the [n] of Linda to an empty coda rather than to the [s] in the onset 
(i.e., beginning) of the second syllable of Lisa. 

9.3 Position in Name 

Some weight ought to be given to absolute initial position in names. Many indexed 
systems, including Soundex, key names to the initial letter. This is, of course, problematic, 
since" the initial letter may be silent or part of y digraph (e.g., Knox, Philip). However, 
indexing on the first sound, or at least considering the first sound as more significant than 
sounds in other positions ?nay be warranted. This, like syllable-level comparisons, will 
probably be a factor in final sorting of names retrieved. 

9.4 Non-Phonological Factors in Sorting of Names Retrieved 

Certainly, it must be acknowledged that non-phoncrlogical levels of analysis may be critical 
to any useful definition of similarity. Morphological units - word parts that may contain 
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semamic information, including prefixes and suffixes — such as A/c-, -ton, and -sky arc 
likely sources of variations. Thus» Lubin and iMbinsky are crilically related (in terms of 
their roots), while Lubin. Rubin and Lupine^ are very close in terms of articulation. The 
morphological factor could be handled efficiently wiih a look-up list of morphological 
elements, but this remains outside the current scope of this project. 

Similarly, orthography itself might play a usel'ul role in the final sort of names retrieved. 
The following names retrieved for a fuzzy search on the name Bucket have been sorted 
using a simple sort on letters. 



Search on BUCKET 
Feature Difference Threshold: 1 

BYXKETT 

BEXKET 

BIXKET 

B Ye YET 
BEXKETT 

BIXKETT 
BYXHHEIT 

BOQYET 

BAG YET 

BOX HAT 

BYXHITE 
BEXKOITH • 

BAXOT 
BOOKOYT 

BAXOTB 
BOY0YET 

BEKHIT 
BOQXYTT 
BE&YETTE 



Figure 3 Search on Bucket Sorted by Spelling 

Current plans are for a final ranking of names retrieved based on a.combination of factors, 
including number of syllables, stress, weighting of features by sonority levels and name- 
initial segments. 

10.0 Conclusions 

In sum. automatic name searching can benefit in three ways from incorporation of 
phonological information: 
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• leveling differences due exclusively to orthographic mapping; 

• leveling differences due to predictable phonological processes, such as intrusive 
consonants; and 

• retrieving additional names that contain phonologically similar sounds to chose of the 
query name. 

Having retrieved phonologically relevant names, a phonologicaliy-enhanced name search 
engine can then sort names using a multiple factor weighting scheme. 

LAS views this technology as extremely promising, offering a tool to enhance current 
automatic name searching, increasing chances of retrieving name variants that character- 
based systems miss by retrieving and sorting names in a phonologically principled way. 



C Language Analysis Systrmx. Inc., 1997 



10 



03/06 '98 12:46 ID:LANG. ANALYSIS SYSTEHS FAX: 703-834-6230 



PAGE 13 



Appendix A: Descriptions of IPA Symbols 



Phonetic 


DescripftDK' 


Example 


symbol 






P 


voiceless bilabial stop 


p in the English name Peter 


b 


voiced bilabial stop 


b in the English name Buddy 


0 


voiceless bilabial fricative 


f in the Japanese name Kujimori 




voiced bilabial fricative 


b in the Spanish word saber 


ni 


bilabial nasal 


m in The English name Mary 




voiced rounded palatal approximant 


u in the French name Nuit 


r . 


voiceless labio-denial fricative 


fin the English name Fred 


V 


voiced iabio-denial fricative 


V in the English name Vera 




voiced labio-dental nasal 


n in the Italian word anfora 


t 


voiceless alveolar stop 


t in the English name Ted 


d 


voiced alveolar stop 


d in the English name Doug 


e 


voiceless apico*dental fricative 


th in the English name Theodore 


X 


voiced apico-dental fricative 


th in the English name Rather 


s 


voiceless alveolar fricative 


s in the English name Sam 


z 


voiced alveolar fricative 


z in the English name Zachary 


n 


voiced alveolar nasal 


n in the English name Nathan ' 


1 


voiced alveolar lateral 


1 in the Lnglish name Linda 




voiceless alveolar lateral t'hcative 


11 in the Welsh name Llewellyn 


© 


voiced alveolar lateral fricative 


dhl in the Zulu word dhla (to eat) 


□ 


voiced alveolar continuant 


r in the English name Richard 


r 


voiced apico-alvcolar trill 


r in the Spanish name Ricardo 


O 


voiced alveolar flap 


tt in the English name Ritter 




voiceless reiroflcx stop 


as in the Arabic name Tariq 




voiced reiroflex stop 


as in the Arabic word difda' (frog) 


■ 


voiceless reiroflex fricative 


as in the Arabic name Sabir 




voiced retroflex fricative 


as in the Arabic name Dhafir 




voiced retroflex nasal 


Maraihi (India) 




voiced retroflex lateral approximant 


Marathi (India) 


M 


voiced retroflex flap 


d as in Hindi dai (lentil stew) 


* 


voiceless palato-alvoelar fricative 


sh in Hie English name Sheila 


0 


voiced palato-alvcolar fricative 


z in the English word azure 




voiceless alveo-palatal fricative 


X as in the Chinese name Xia 


✓ 


voiced alveo-palatal fricative 


T in the Polish word flc 


14 


voiceless palato-alvoelar affricate 


ch In the English name Charlie 


dO 


voiced palato-alveolar affricate 


j in the English name Jennifer 




voiced paiaial nasal 


■CD in the Spanish word Do'CDa 


<!> 


voiced pHlatal lateral approximant 


II in the Spanish word callc (street) 


k 


voiceless velar stop 


k in the English name Kim 




voiced velar stop 


g in the P.nglish name Gary 


X 


voiceless velar fricative 


X in (he Spanish name Jose 




voiced velar fricative 


g in the Spanish word luego (later) 




voiced velar nasal 


ng in the English name Bing 
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Appendix A: Descriptions of TPA Symbols (Continued) 



Phonetic 
symbol 


Description 


— - Example 


& 


voicelesij velar laicral 


1 in ihe Polish Walesa 




voiceless labio-velar approxlmanl 


wh as in the thi;iish name White (for 
some speakers) 


w 


voiced bilabial approximanl 


w in the English name Wayne 


q 


voiceless uvular stop 


as in the Arabic name Qasim 


G 


voiced uvular stop 


Eskimo ^nd Tehrani Persian 




voiceless uvular fricative 


ch as in the German word Buch 




voiced uvular fricative 


r in some Parisian pronunciations of the 
French name RcnJe 


N 


voiced uvular nasal 


n in the Eskimo word eNima (melody) 


R 


voiced uvular irill 


r in the French name RenJe 


=? 


voiceless pharyngeal fricative 


h a.s in the Arabic name Muhammad 




voiced pharyngeal fricative 


as in the Arabic liame Sa'ad 




voiceless glonal stop 


tt as in the English name Sutton or the 
word miuen 


h 


voiceless gloital fricative 


h in the English name Henry 


o 


voiced gloitsi fricative 


h as in English between voiced sounds, 
as in the word manhood 


y . 


high from rounded vowel 


u in the French word lunc (moon) 


• 


high central unrounded vowel 


as in the Russian word s*n (son) 




High central rounded vowel 


u as in the Norwegian hus 


o 


high back unrounded vowel 


u as in the Japanese name Kazu 


u 


high back rounded vowel 


ou as in the French word tout 


Pi 


upper mid-fronl rounded 


6 as in the German name Sch'nfeld 


^ 


upper mid-back unrounded vowel 


as in the Shan (Biinna)word *ko (salt) 


0 


upper mid-back rounded 


0 as in the English name Mona 


1 


semi-high front unrounded vowel 


y as in the English name Lynn 


e 


lower mid-froni unrounded 


c as in the English name Deborah 


» 


lower-inid front rounded vowel 


oeu as in the .French word oeuf (egg) 


G 


lower-mid back unrounded vowel 


u as in the Engli<;li name Tuppcrman 


e 


lower-mid back unrounded 


0 as in the Englisli name l ord 


F 


open front unrounded vowel 


a as in the English name Hal 


O 


open central unrounded vowel 


a as in the Poriuguesc word para (for) 


£S 


low front unrounded vowel 


a as in the French word patte (paw) 


a 


low central unrounded vowel 


a as in the French name Delatre or the 
word p>ie (paste or dough) 


CD 


low back rounded vowel 


0 as in the British English word hoi 


ir 


mid central unrounded vowel 


e & a as in the English name Belinda 


u 


semi-high back rounded vowel 


u as in the English name Butch 


e 


upper-mid front unrounded 


a as in ihu English name Mable 


i 


high from unrounded vowel 


first e in the English name Pete 


1S1 


rhotacized mid- vowel 


ea as in the English name Heather 
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Appendix A: Descriptions of IPA Symbols (Continued) 



Phonetic 
symbol 


Description 


Example 


t0 


voiceless aiveo-palatal affricate 


j as in the Chinese name Jin 




voiceless aspirated alveo-palatal atTricaie 


q as in the Chinese name Qiu 


ts 


voiceless unaspirated dental affricate 


ts as in the Chinese name Tsang 


ts' 


voiceless aspirated dental affricate 


c as in the Chinese name Cao 


® 


bilabial click 


as in Southern Bushman languages ' 




dental (alveolar) click 


as in Bushman 


j 


palatal click 


as in Bushman 




palato-alveolar click 


as in Houentci 


® 


alveolar lateral click 


as in Bushman, Zulu 
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Appendix B: Description of Phonetic Features 

A. Major class features: 

/. Syllabic 

Forms the central peak of a syllabic. Vowels arc usually +syHabic, consonants are 
usually -syllabic, but some (like [ I J) may be syllabic (as in "riddle") 

"■'-j;-^ Sonorant 

Minimal constriction in the mouth. Vowels, as well as I n J, [ m J. [ r ], [ t j, [ w ) are all 
+sonorant. Most other consonants are -siinorani. 

S. Consonantal 

Obstruction along a central poini in ihc moulh. All English sounds e.xcept vowels and - 
glides ([ w J and [ y ]) are • consonantal. 

B. Manner of Articulation Features: 

4. Continuant 

Continued air movement through the mouih during sound production. This feature 
contrasts fricative sounds like f f 1 and [ v ] with non-continuanis like [ p ) and [ b ]. 

5. Strident 

Narrow obstruction through which air escapes, producing hissing or "white noise". [ s J. 
I 2 ], [ f ]. [ V I and the sounds in church and judge are +stridcni. This is the most 
acoustically-based feature in this li.si. 

6. Delayed Release 

Gradual release of air. In English, ii is used to distinguish the sounds in church and 
judge from [ t ) and [ d ] 

7. Nasal 

Soft palate at the back of the mouth is lowered and air goes iniu nose. In English. | n ]. 
( m ] and [ A J (the final sound in king) arc -t-nasal. 

Lateral 

Side(s) of tongue lowered so thai air escapes along side, as in English [ 1 ]. 

C. Place of articulation: 

9. Anterior 

Obstruction of mouth anywhere from gum ridge forward to lips, linglish I p ], [ b ). 
( m ]. ( f J. I V ]. and [ ] (as in the) are all »-anicrior. 

10, Coronal 

From of the tongue raised. The sounds f i j and f d J are +coronal. Sounds like I k ] and 
i £ 1 are -coronal. 

IL High 

Body of tongue raised. [ j | (as in yellow), and the vowel [ H ] (as in feel) are -high. 
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Appendix B: Description of Phonetic Features (Continued) 

12, Low 

Body of tongue lowered. The vowels f *^ ) as in back and [ 9 ] as in father are +low. 
/i. Back 

Body of tongue moved back. The sounds [ k ] and [ g ] and lhe vowel | u ] as in hool arc 
+back. 

14. Tense 

Root of tongue muscle tensed. The vowel ( H ) (as in feet) is +icnse. The yowel [ ?^ ] as 
in fit is -tense. 

15. Round 

Lips pursed or rounded. English vowel [ u ] (as in booO is ^-round, while [ ] (a.s in 
beet) is -round. 
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Appendix C: Phonetic Features for | p ], | b ] and | f 1 



Phonetic Features 



syllabic 



consonantal 



curunal 



high 



low 



back 



contlnnant ■ 



strident 



delayed release 
voiced 
nasal 
lateral 



round 



IPI 
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1.0 Introduction 



This narrative describes the algorithms and techniques used by the Name Search - Technology 
Demonstration System (NS-TDS). It is the English-language version of the C++ source code 
that was used to develop the system. There are three major sections, covering NS-TDS support 
riles, building the data base and performing a query. Each section relies on the contents of the 
previous section, so to effectively understand the system, this document should be read from 
beginning to end. 

This narrative is tied to the source code through the use of paragraph numbers and comment 
-lines, tl&ns, whenever a block of code implements a technique or algorithm described in this 
document, a comment line has been inserted referencing the paragraph number. Comment lines 
are in the format, *7/ narrative paragraph number, x.x", where "x.x" stands for the paragraph 
number. If a block of code refers to more than one narrative paragraph, additional comments are 
added as separate lines. 
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2.0 Support Files 



NS-TDS is a data-driven application dependent on a number of files that encapsulate years of 
computational linguistic research. These files represent the heart of the system and are essential 
to understanding how the primary algorithms work. This section of the narrative introduces 
these flies by describing their purpose, contents and use. 

2.1 Name Classifier Tables 

TDS classifies the culture of a name as either Arabic, Chinese, Hispanic or "Other" (the default) 
by statistically analyzing its spelling. This analysis is accomplished with the aid of the following 
culture specific statistical distribution tables: 

2.1.1 Digraph Score 

Digraphs are contiguous letter pairs formed by parsing a name bracketed by a beginning and an 
ending boundary. For example, the name "FRED" consists of five digraphs: "#F", "FR", "RE", 
"ED" and "D#", where the symbol "#" represents a name boundary. In this table, digraphs that 
are clear indicators or contra-indicators of a particular culture are stored with a relative score. 
NS-TDS uses culture-specific tables to show the statistical likelihood of a particular digraph 
occurring in the applicable culture. For example, the digraph "QA" occurs almost exclusively in 
Arabic names, whereas the digraph "FM** almost never occurs in Arabic names. In the Arabic 
digraph table, "QA" is associated with a high positive score, and "FM" is associated with a low 
negative score. 

2.1.2 Trigraph Score 

Trigraphs are contiguous lener triplets that, for the purposes of TDS. are limited to the beginning 
and ending trigraphs. For example, the name "FRED" consists of the trigraphs "#FR" and 
"ED#". As with digraphs, NS-TDS uses culture specific tables to show the statistical likelihood 
of a particular trigraph occurring in the applicable culture. 

2.1 J Name Stop List 

While generally good indicators of culture, digraph and trigraph distributions can erroneously 
classify specific names. For example, the name "BARKER" is identified as Arabic because it 
contains common Arabic letter patterns. The Name Stop List tables were implemented as a 
stopgap fix to this problem. For each culture, there is a Name Stop List table that contains a 
name along with a score that is either very positive (set to 2000) or zero (0). A high score means 
that name belongs to that culture; a score of zero means that the name does not belong there. So, 
"BARKER" is in the Arab Name Stop List with a score of 0. 

The information in these tables is repeated for each culture and name part (i.e., given name or 
surname). For example, the following tables exist for Arabic: 



agdi.dbf 
agtri.dbf 
asdi.dbf 
astri.dbf 



Arabic digraph scores 
Arabic trigraph scores 
Arabic digraph scores 
Arabic trigraph scores 
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agnames.dbf Arabic given name stop list 
asnames.dbf Arabic surname stop list 



There are similarly named tables for Chinese starting with the leuer "c" and tables for Hispanic 
starting with the letter "h". 

2.1.4 Phonetic Rules 

In order to convert the spelling of a name into a phonetic representation, NS-TDS consults 
several rule files. They contain records that consist of search parameters based on spelling and 
replacement regular expressions based on International Phonetic Alphabet (IPA) characters. 
Take the following rule, for example: 

" Boundary, "KN", Vowel, "(kn|kan|n) 

It says that if the letter string "KN" is found at the beginning of a name ("Boundary") and is 
followed by a vowel, replace it with the IPA string, "(kn|kan|ny' , where "I" indicates "or". The 
replacement string indicates that there are three possible pronunciations: [kn] or [kan] or [n]. 
(The use of square brackets is standard phonetic notation to indicate sounds rather than spelling). 
The spelling of a name is run through the rules until all characters are replaced with regular 
expressions. The name "KNOX" thus results in the regular expression (kn|kan|n)(a)(ks). 

NS-TDS uses eight rule sets. For each of the four cultures (Anglo, Arabic, Chinese and 
Hispanic), there is a single vowel rule set and a multiple vowel rule set. The one vowel versions 
level all vowels to an [a] and produce fewer variations They are used for retrieval. The multiple 
vowel versions contain three basic vowel sounds, [a], [i] and [u], and are used in the ranking of 
retrieved names, since they are more precise than single-vowel rankings. 

2.1.5 Simplified Phonetic Rules 

One additional phonetic rule file is maintained to aid in the filtering process. It is a cross- 
reference file between all of the possible replacement strings in the single vowel rule sets and a 
simplified version of the replacement string. It is "simplified" in the sense that all unbalanced 
"ors" become balanced. For example, the replacement string (kn|kan|n) is "unbalanced" in that 
the possible pronunciations can contain one, two or three sounds. The simplified version is, 
(k?)(a?)(n), where "?" means that the sound is optional. The simplified string allows TDS to 
compare two regular expressions that may generate thousands of possible pronunciations with 
one calculation, thereby improving performance dramatically. Note that the simplified strings 
sometimes generate more possible pronunciations than the original replacement string, but never 
fewer. The additional pronunciations are handled adequately by the Ranker (see Section 4.5.1). 
Currently, this file is named tds.simpj-ul. 

2.1.6 Leveled IPA Matrix 

Generating retrieval keys requires the creation of leveled IPA variant strings. That is, similar 
sounds (i.e., [s] and [z]) are treated as a single set. NS-TDS uses a cross reference file to define 
the set relationships. It is currently called, grouparray.dat. 

2.1.7 Feature Difference Matrix 
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One of the key components of TDS is the ability to calculate a phonetic score when comparing 
two names. When comparing individual sounds, the calculation weights the difference between 
two sounds based on a/eature distance .matrix. This matrix consists of all combinations of two 
IPA characters and a score between 0.0 and 1.0 representing their phonetic proximity to one 
another, as derlned by aniculatory measures of similarity. It also contains records that represent 
the insertion or deletion of an extra sound. For example, the score assigned to the replacement of 
a [t] with a [d] is lower than the score assigned to the replacement of a [t] with a [k]. Further, 
inserting a vowel is given a lower score than inserting consonant such as [t] or [k]. 

The scores. contained in this matrix reflect penalties. That is, higher scores mean that the sounds 
are further apart. All of the scores are based on linguistic principles of articulation, and reflect 
the number and type of phonetic features that cause the sounds to be different. 
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3.0 Building the NS-TDS Data Base 



In order to search a large number of names quickly, NS-TDS uses a data base of name 
information and indices. This data base is built by a program hereafter referred to as the Data 
Loader. This program takes as inpufa text file of names that are preceded with a group ID. The 
group ID is a minor component of TDS liiat was impiemenied to facilitate an independent 
evaluation by ORD. 

Building the data base consists of two major steps. First, the names are pre-processed to 
generate the information needed by the retrieval, filtering and ranking algorithms. This pre- 
processed data is stored in temporary tables that are subsequently turned into the NS-TDS data 
• base ariSi mdices' The following paragraphs describe this process in detail. Where appropriate, 
examples are used to make the description easier to understand. 

3.1 Pre-Process Names 

All names are pre-processed to ensure validity (see 3.11) and to gather the information necessary 
for retrieval, filtering and ranking. This process shares many of the components used to pre- 
process a query name during an NS-TDS search. 

3.1.1 Edit the Name 

Input names are provided to the Data Loader in a text file and are edited according to the 
following specifications: Positions 1 through 6 must contain a group ID, where the first 
character must be a digit or the lener "Z". All other characters must contain a digit. Position 7 
must be blank. Positions 8 though 37 contain the name and can only consist of upper case letters 
or an apostrophe. Furthermore, the name must be at least 2 characters in length and no longer 
than 30 characters. Any records that fail to follow the prescribed format are rejected and written 
to an error log, along with an appropriate message. 

3.1.2 Classify the Name 

The spelling of the name is statistically analyzed to determine the probable culture (Arabic, . 
Chinese, Hispanic, or "Other**). This analysis is accomplished with the aid of the name classifier 
tables. 

First, the name is parsed into digraphs (contiguous lener pairs) and beginning and ending 
trigraphs (contiguous three letter triplets) (see 2.1.1 and 2.1.2). Next, the digraphs and trigraphs 
are located in the appropriate classifier table to obtain the individual score. Ail of the scores are 
summed to obtain a total score. This process is repeated for all cultures. 

Then the Name Stop List tables for each culture are checked. If the name is found in one of the 
tables, the associated score is returned ("2000" means in the culture, "0" means not in the 
culture). If the name is found, the previously calculated culture score is replaced. 

Finally, each score is compared to a culture-specific threshold. If no scores exceed the culture 
threshold, the name is classified as "Other". If one score exceeds the appropriate threshold, the 
name is classified accordingly. If more than one score exceeds the culture threshold, the highest 
score is chosen and that culture is returned. If there is a tie (very unlikely), the culture is chosen 
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alphabetically with Arabic first followed by Chinese and then Hispanic. It is important to note 
that an input name will receive only one classification. 

3.1.3 Generate 1 Vowel Regular Expressions 

In this step, the spelling of the name is run through the speiling-to-IPA phonetic conversion 
rules, to generate a regular expression that represents all of the possible pronunciations of the 
name. Every name is run through the single-vowel Anglo phonetic rule set, which is the 
default/generic rule set. If the name was classified as Arabic, Chinese or Hispanic, it is also run 
through.the appropriate single-vowel rule set for that culture, generating a second IPA regular 
expression.' 

3.1.4 Generate Simplified Regular Expressions 

Using the simplified phonetic rules, an simplified regular expression is generated. The 
expression is encoded into compact byte representations to make further calculations faster. As 
before, if the name was classified as Arabic, Chinese or Hispanic, a second simplified regular 
expression is generated according to the appropriate rule set. 

3.1.5 Generate 1 Vowel Variants 

Using the generated regular expression, a list of possible IPA variants is generated and added to 
a temporary table of variants for all input names. As an example, the name "KNQX", which 
generates the regular expression (kn|kan|n)(a)(ks), generates the following variants: [knaks], 
[kanaks], and [naks]. The temporary table lists all variants, as well as the name that generated 
the variant. It used later in the data base build process. 

3.1.6 Determine the Initial Consonants 

The variants are then analyzed, to generate a list of all possible name-initial IPA consonants. For 
example, the name "KNOX" starts with the regular expression (kn|kan|n), which can have an 
initial CQnsonant of [k] or [n]. Note that if the variant starts with an IPA vowel, the first IPA 
consonant is used to build this list. Thus the name O'NEIL would have [n] as.the initial 
consonant. This information is used by the Ranker (see section 4.4.4 below). 

3.1.7 Set the Initial Vowel Switch 

Next, the variants are analyzed to determine if it is possible for the pronunciation of the name to 
start with a vowel. It is a three-way switch that indicates whether the pronunciation (I) can 
never start with a vowel, (2) can sometimes starts with a vowel or (3) always starts with a vowel. 
This information is used by the Ranker (see section 4.4.5 below). 

3.2 Build Data Base and Indices 

This step takes all of the information produced during pre-processing and builds the data base 
and indices used by TDS for retrieval, filtering and ranking. 

3.2.1 Create Name Files 
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A name file is generated for all four cultures processed by TDS (Anglo, Arabic, Chinese and 
Hispanic). "Anglo" represents the default, and therefore all names generate an Anglo record; _ 
only those names that are appropriately classified generate records in the other culture name 
files. Each record in the name file contains: the spelling of the name, the simplified regular 
expression codes, the list of initial consonants, the initial vowel switch, the group ID and an 
internal unique ID. 

The naming convention is a four-letter culture identification followed by the extension, "nam". 
Currently, the following name files arc generated: angLnam, arab.nam, chin.nam and hispMom. 

3.2.2 Generate Leveled Variants 

Using the variants generated during pre-processing and the leveled IPA matrix, a list of leveled 
variants is built. Furthermore, the input variants have duplicate contiguous characters removed. 
The IPA characters in "KNOX" generate the following numeric codes, based on sets of similar 
sounds: [k] = 5; [n] = 2; [a] = 0; [s] = 4; [z] = 4. (Note that [s] and [z] are both indexed as "4". 
since they are similar sounds). The following unique leveled variants are generated: 52054, 
502054 and 2054. Note that the number of leveled variants is usually less than the number of 
non-leveled variants. For each input name, the leveled variants are added to a temporary file that 
lists all leveled variants and the name that generated it. 

3.2.3 Create Retrieval Indices 

Retrieval indices consist of a unique sorted list of leveled variants. As with the name files, one 
index is generated for each culture. Each index is created by sorting and then deduping, i.e.. 
removing duplicate forms from the previously-built temporary file of leveled variants. 

The files produced by this step are named angUdx, arabJdx, chinJdx and hisp.idx, 

3.2.4 Create Indcx-to-Name Maps 

Finally, a map file is created that cross-references all of the index records with the name records 
that generated the leveled variant. The retrieval index records contain a pointer to a map record. 
The map record contains a list of pointers to name records. This structure allows TDS to quickly 
scan the indices and generate a list of candidate names during retrieval. 

The names of the map files are angLvec, arab.vec, chm.vec, and hisp.vec. 
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4.0 Perferming a Query 



The heart of NS-TDS is the ability to perform a query that returns a ranked list of results. This is 
done using five major steps: query pre-processing, exact phonetic search, similar phonetic 
search, initial ranking and final ranking. The following paragraphs describe these steps and their 
components in some detail. When appropriace, examples are used to make the descriptions 
easier to understand. 

4.1 Pre-Process Names 

The query name is pre-processed to ensure validity and to gather the information necessary for 
retrieval, filtering and ranking. This process shares many of the components used to pre-process 
an input name during the building of the NS-TDS data base. 

4.1.1 Edit Name 

After the user enters a query name via the user interface, it is edited according to the following 
criteria: The name can only consist of the 26 letters of the Roman alphabet or an apostrophe. 
Further, the name must be at least 2 characters in length and no longer than 30 characters. Errors 
are displayed to the user in a dialog box, along with an appropriate message. 

4.1.2 Classify Name 

If the user has specified that culture classification is automatic (the default), the spelling of the 
name is statistically analyzed to determine the probabie culture (Arabic, Chinese, Hispanic, or 
'^Other")- This analysis is accomplished with the aid of the previously described name classifier 
tables (see 2. 1 above). If the user has overridden the default and manually specified a culture, 
this step is skipped. . . 

The name is then parsed into digraphs (contiguous letter pairs) and beginning and ending 
trigraphs (contiguous three letter triplets). Then, the digraphs and trigraphs are located in the 
appropriate classifier table to obtain the individual score. All of the scores are summed to obtain 
a total score. This process is repeated for all cultures. 

Next, the Name Stop List tables (see 2.1.3 above) for each culture are checked. If the name is 
found in one of the tables the associated score is returned (*'2000" means in the culture, "0*' 
means not in the culture). If the name is found, the previously calculated culture score is 
replaced. 

Finally, each score is compared to a culture specific threshold. If no scores exceed the culture 
threshold, the name is classified as "Other". If one score exceeds the appropriate threshold, the 
name is classified accordingly. If more than one score exceeds the culture threshold, the highest 
score is chosen and that culture is returned. If there is a tie (very unlikely), the culture is chosen 
alphabetically, with Arabic first, followed by Chinese and then Hispanic. 

It is imponani to note that an input name will receive only one classification. Further, if the 
classification is Arabic, Chinese or Hispanic, all funher pre-processing will be performed twice 
(once for the default Anglo culture and once for the culture identified by classification or as 
manually specified by the user). 
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4.1.3 Generate 3 Vowel Regular Expressions 

In this step, the spelling of the name is run through the multiple vowel phonetic rules to generate 
a state table that represents all of the possible pronunciations of the name in IPA form. Every 
name is run through the default Anglo phonetic ruie set. If the name was classified as Arabic, 
Chinese or Hispanic, it is also run through the appropriate rule set for that culture generating a 
second state table. These will be used during the exact phonetic search (see 4.2). 

4.1.4 Generate Multiple (3) Vowel Variants 

Using the multiple vowel state table, a list of all possible IPA variants is generated. As an 
example, the name, "KNOX", which generates the regular expression (kn|kan|n)(a|u)(ks), 
generates the following IPA variants: [naks], [nuks], [kanaks], [kanuks], [knaks], and [knuks]. 
This list will be used to perform a brute force phonetic score adjustment on names that pass 
preliminary ranking (see 4.5), 

4.1.5 Generate 1 Vowel Variants 

Using the 1 -vowel state table,, a list of all possible one-vowel IPA variants is generated. As an 
example, the name, "KNOX", which generates the regular expression, (kn|kan|n)(a)(ks), 
generates the following variants: [naks], [kanaks], and [knaks]. This list will be used to generate 
retrieval and ranking information. 

4.1.6 Determine the Initial Consonants 

The single-vowel variants are then analyzed to generate a list of all possible initial IPA 
consonants. For example, the name KNOX starts with the regular expression, (kn|kan|n), which 
can have an initial consonant of [k] or [n]. Note that if the name starts with a vowel, the first 
IPA consonant is used to build this list. Thus, the name, 0*NEIL would have [n] as the initial 
consonant. This information is used during ranking (see 4.4.4). 

4.1.7 Set the Initial Vowel Switch 

Next, the single-vowel variants are analyzed to determine if it is possible for the pronunciation of 
the name to start with a vowel. It is a three-way switch that indicates that the pronunciation (I j 
can never start with a vowel, (2) can sometimes start with a vowel, or (3) always starts with a 
vowel. This information is used during ranking (see 4.4.5). 

4.1.8 Generate Leveled Variants 

Using the appropriate one-vowel rule set, a temporary list of all possible IPA variants is 
generated. This list, along with the leveled IPA matrix, is used to build a list of leveled variants. 
Also note that duplicate contiguous characters are removed. For example, the IPA characters In 

KNOX generate the following numeric codes, based on sets of similar sounds: [k] = 5; [n] = 2; 
[a] = 0; [s] = 4; [z] = 4. (Note that [s] and [z] are both indexed as "4", since they are similar 
sounds). The following unique leveled variants are generated: 52054, 502054 and 2054. Note 
that the number of leveled variants is usually less than the number of non-leveled variants. 
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4.1.9 Generate Simplified Regular Expressions 

Using the simplified phonetic rules, a simplified regular expression is generated. The expression 
is encoded into compact byte representations to maice further calculations faster. As before, if 
the name was classified as Arabic, Chinese or Hispanic, a second simplified regular expression is 
also generated. 

4.1.10.vlqittalize Search Parameters 

The search parameters specified by the user or defaulted by the application are stored along with 
the query information. These parameters set thresholds for retrieval and filtering, and determine 
the weights given to individual ranking scores. 

4.2 Exact Phonetic Match 

An exact phonetic search is always performed by TDS. It is a quick search that retrieves names 
which share at least one .possible pronunciation with the query name and passes them to the 
ranker. A search of the Anglo data base is always performed; if a non- Anglo culture was 
determined or specified by the user, the search is repeated for the appropriate culture. 

4.2.1 Retrieve Candidates 

Each of the leveled variants generated by pre-processing the query name are used as a key to 
perform a binary search of the retrieval index, which is a set of unique leveled variants for the 
name data base. In the case of "KNOX", three leveled variant indices are retrieved: 2054, 
502054 and 52054. 

4.2.2 Retrieve Name Information 

Using the index-to-name map files, all data base names and associated information that could 
possiblv generate the leveled variants found above are put into a list. In the case of "KNOX", 
names such as "NOCKS", "NOX", "KNOCKS" and "NAUCHS" are returned. 

4.2.3 Execute Exact Phonetic Match Algorithm 

For each name retrieved, a regular expression is generated using the appropriate multiple-vowel 
rule set. Each of these is compared to the query's regular expression to determine if there is an 
intersection. In other words, the two expressions are evaluated to see if they can generate a 
matching variant. This evaluation is done by generating non-deterministic finite state tables and 
walking through each table until a match is impossible (i.e., the names do not match), or the end 
of both tables is reached (i.e., the names match). If the name passes this algorithm, it is placed in 
a list. 

4.2.4 Pass Exact Matches to the Ranker ^ - 

All names that pass the exact-match algorithm are sent to the Ranker, along with the information 
retrieved from the name file. In addition, the phonetic score is set to l.O, which is the highest 
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possible score, and the pipe (rule set) that was used to retrieve the name is passed. Note that if a 
name was found to be an exact match under two cultures, it is included twice. 

4.3 Similar Phonetic Match 

A similar piionetic search is performed only if the user has requested it. Note that the default is 
to perform a similar search. It is slower and more thorough than the exact search, and retrieves 
names that sound similar (based on principles of articulation) to the query name to the ranker,. A 
search of the Anglo data base is always performed; if a non-Anglo culture was determined or 
specified.by the user, the search is repeated for the appropriate culture. 

4.3.1 Scan the Retrieval Index 

A complete scan of the retrieval index is performed, and each leveled variant is compared to the 
leveled variants generated by pre-processing the query. The comparison uses a standard edit 
distance calculation to determine how far apart two strings are. The algorithm determines the . 
minimum number of edits (insertion, deletion or replacement of IPA characters) necessary to 
convert one string into another. A score is calculated by dividing the number of edits by the 
maximum length of the two strings and subtracting this fraction from 1 resulting in a score 
between 0,0 and 1.0. This score is compared to the retriever threshold, and those records with a 
score greater than or equal to the threshold are added to a candidate list. 

4.3.2 Filter the Candidates List 

This list is scanned and, if the name has not already been retrieved by the exact match algorithm, 
the simplified regular expression of the query name is compared to simplified regular 
expressions of all of the candidate names. This comparison uses a more linguistically 
sophisticated edit distance algorithm that takes into account the phonetic features of each sound. 
All edits are weighted according to the relationships stored in the Feature Distance Matrix, For 
example, the replacement of similar sounds that share most phonetic characteristics, like [s] and 
[z], are given a small penalty. The "cost" of replacements is determined by where and how 
sounds are articulated in the mouth and the "effort" required to produce one rather than the other. 
Similarly, some insertions and deletions of sounds are more costly than others (e:g., insertion of 
a [t] is more costly than insertion of a vowel, [a]). So, instead of computing the minimum- 
number of edits required to convert one string into another, this algorithm calculates the path of 
least resistance. As with the retrieval calculation, a score between 0.0 and 1.0 is obtained by 
dividing the total penalty by the maximum length and subtracting this fraction from 1. This 
score is compared to the filter threshold, and those records whose score is less than the threshold 
are discarded. 

Finally, It is important to note that because simplified regular expressions can generate more 
variants than the expressions they were derived from, it is possible for this score to be higher 
than expected, although it is impossible to obtain a lower score. This deficiency is corrected 
during ranking. 

4.3.3 Pass Similar Matches to the Ranker ' 

For all records that pass the filter algorithm, additional information is gathered from the name 
file via the index-to-name map. Also, the phonetic score of these names is set to the score 



Copyright O 1998 • all rights reserved 
Language Analysis Systems. Inc. 



11 



02/26/98 



calculated during the filter edit distance calculation, and the culture pipe (i.e., rule set) that was 
used to retrieve the name is passed. This list is sent to the Ranker for initial scoring. Note that if 
a name was found to be a similar match under two cultures, it is included twice. 

4.4 Initial Ranking 

Names that pass the retrieval and filter stages via the exact or similar phonetic match searches 
are sent to the Ranker, along with the phonetic score calculated by the filter and all of the data 
built during the pre-processing stages (initial consonant, initial vowel, etc.). The ranker also, 
knows.vvhich search (exact or similar) produced the return. It takes this information, calculates 
several other scores, applies weights to those scores based on the query parameters and produces 
a ranked list of names with combined scores. Initial ranking differs from final ranking in that it 
uses the phonetic score calculated by the filter. Final ranking, which is described below (see 
4.5), performs a more exhaustive and exact phonetic score calculation. 

4.4.1 Calculate the Spell 1 Score 

Because spelling is a relevant factor in determining similarity of names, the Ranker is set up to 
consider spelling in its calculations in ranking of names passed to it. In the case of exact 
matches, for example, all phonetic scores are l.O, but spellings can vary widely (e.g., "LI", 
"LEE", "LEIGH"). The Spell I score is a comparison of all the letters in the query name to all 
of the letters in the data base name. Each letter that matches contributes to the score, and no 
letter can be used more than once. Note that the position of the letter has no bearing of the score. 
So, the "K" in "KNOX" matches the "K" in "SACK". In addition, there is an option to bias the 
score so that letters on the left side of the query name count more than those towards the end. 
The "left-bias" factor defaults to true. A score is calculated by dividing the value of the matches 
(an integer, if left bias is not used) by the maximum length of the query and data base name and 
then subtracting this fraction from 1 .0. This results in a score between 0,0 and 1 .0. 

4.4.2 Calculate the Spell 2 Score 

The Spell 2 score works similarly to Spell /, except that it uses digraphs instead of single letters. 
Digraphs are contiguous letter pairs formed by parsing a name bracketed by a beginning and an 
ending boundary. For example, the name "FRED" consists of five Roman character digraphs: 
"#F", "FR", "RE", "ED" and "D#", where "#" represents a name boundary. Digraphs build some 
contextual information into the calculation, with the result that "FRED" and "BRID", which 
share two non-contiguous letters, have a lower Spell 2 score than a Spell I score. Spell 2 uses 
the same left bias parameter and the same method to turn the calculation into a decimal number 
between 0.0 and LO as Spell I. Finally, the Spell 2 score contains a special adjustment for 
names that start with a vowel. For example, when comparing the name, "NEIL" to "ONEIL" 
and "SNEIL", the score for "NEIL" will be adjusted upwards by a small factor. 

4.4.3 Calculate the Syllable Score 

The Syllable score compares the number of syllables in the query name to the number of 
syllables in the data base name. Counting the number of syllables in a name is based on the 
spelling, and essentially says that a syllable occurs when there are one or more vowels in a row, 
preceded by a consonant or a word boundary. Adjustments are made for special cases such as, 
dipthongs (multiple consecutive vowels pronounced as two syllables) and "E" or "ES" at the end 
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of name which often does not produce a separate syllable. A score is produced by dividing the 
difference in the number of syllables between the query name and the data base name by the 
maximum number of syllables in the query or data base name and subtracting the resulting 
fraction from 1 .0. This results in a score between 0.0 and 1 .0. 

4.4.4 Calculate the Initial Consonant Score 

The initial consonant sound in names hold particularly prominent positions in determining 
similarity of names. The Initial Consonant score compares the first occurrence of an IPA 
consonant in the query name to that of the data base name. The consonants are compared based 
on the Feature Distance Matrix, producing a score between 0,0 and 1.0. For example, [s] and [z] 
will return a high score, whereas, [k] and [r] will return a low score. Note that the first 
consoriant can be different as is the case with the name. "KNOPF", which could stan with a [k] 
sound or an [n] sound. The algorithm compares all possibilities and returns the best possible 
score. 

4.4.5 Calculate the Initial Vowel Score 

The Initial Vowel score comes into play when both the quer>' name and the data base name start 
with a vowel. If this is not the case, the initial vowel score is 1 .0. Otherwise, the IPA vowel or 
vowels that start the names are compared and a score is returned based on the feature distances 
between them. The score is a decimal between 0.0 and 1.0. 

4.4.6 Calculate the Culture Score 

The culture score compares the culture of the query name as determined by the classifier or as 
specified by the user with the "pipe"(rule set) used to retrieve the data base name. So, if a name 
is classified as Arabic, and the name being ranked was passed to the ranker via the Arabic pipe, 
the culture score is 1.0. If the query culture does not match the pipe used to retrieve the name, 
the culture score is 0.0. This allows the Ranker to "bump up" names that share the same cultural 
identity, as in Chinese "CHIN" and "CHANG" (versus non-Chinese "CHAIN", for example). 

4.4.7 Calculate the Final Score 

The fmal or total score is an amalgamation of all of all the previous scores. NS-TDS maintains a 
set of parameters that allows the user to assign weights to each of the various individual scores. 
Note that in the user version, these weights are not modifiable. The weights are intended to be 
percentages, so that each factor is a decimal between 0.0 and l.O and the total adds up to 1.0. 
This is not a requirement, as the calculation recomputes the weights relative to one another. 

So, to arrive at the fmal score, all of the individual scores are multiplied by their weight and the 
results are summed. This results in a decimal score between 0.0 and 1 .0 with a higher score 
indicative of a bener match. 

4.4.8 Set the Ranking Order 

The absolute ranking of data base names is based on whether or not the name is an exact 
phonetic match and on the value of the final score. E.xact phonetic matches are always ranked 
first, followed by similar matches. Within these two categories, names are ranked according to 
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the final score, with higher scores ranked at the top of the list. It is quite possible that an exact 
match will receive a lawer final score than a similar match (if its spelling and/or culture scores 
are low, for example), and yet be ranked above the similar match based on its category of "exact 
match". For example, "KNOX" returns the exact match "NAX" with a lower score (.825) than 
the similar match "KNAGGS" (.848), but forces all exact matches, including "NAX", to the top 
of the list. 

4.4.9 Return a Ranked Set 

Finally, the Ranker eliminates names that do not meet or exceed the threshold set in the NS-TDS 
parameters, unless the name is considered an exact match. Exact phonetic matches are always 
retumedy regardless of their score. 

4.5 Final Ranking 

The purpose of Final Ranking is to incorporate a more accurate phonetic score into the overall 
ranking. Recall that the phonetic score used by the initial Ranker is calculated by the Filter, 
using simple regular expressions. This algorithm, while fast, can inflate the phonetic score, 
producing inaccurate ranking. Further, the Filter calculates using single vowel rules, which can 
introduce another source of inaccuracy (e.g., "LITZ" = "LUTZ"). Final ranking adjusts the 
phonetic score by performing a brute force edit distance calculation, using multiple vowel rules. 
This calculation is performed at this point because it is time-consuming, and must be limited to 
the smallest possible set of input data to meet performance requirements. Note that Final 
Ranking is a parameter option, although the default is set to true. 

4.5.1 Recalculate the Phonetic Score 

All names that were retrieved via the similar phonetic search and passed initial ranking are 
reprocessed to produce an accurate phonetic score. Names that were retrieved via the exact 
phonetic match search do not need to be recalculated because their phonetic score is always 1.0. 
First, the names are passed through the appropriate cultural multiple vowel rule set to produce a 
list of all possible IPA variants. Then, a brute force edit calculation is performed; every variant 
from the query name is compared to every variant of the data base name by performing a 
phonetic edit distance calculation. The best score is retained and assigned to the result. . 

4.5.2 Recalculate the Final Score 

Using the same logic as that used by the initial Ranker, the Final Score is calculated using the 
new, more precise phonetic score. This will result in lower scores for some names; if these 
names fall below the final score threshold, they are removed from the ranked list. 

4.5.3 Rebuild the Ranked Set 

Finally, using the new score, the set of names is ranked again; with exact matches at the top 
followed by similar matches. Also, the final ranker will only return the maximum number of 
names requested by the user. The default setting is 145. Thus, it is possible for a name to pass 
NS-TDS, but not be displayed on output- 
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1.0 Introduction 



This Technical Plan describes LAS*s proposed design for the Technology Demonstration System 
(TDS) and includes a conceptual design, the target hardware platform, operating system, support 
software and development environment, the name data base and a work plan that provides a 
schedule for development and implementation. 

2.0 Background 

The TDS project is the result of the findings and recommendations of the Name Search Research 
Project/conducted from September, 1995 through June, 1997. The goal of the Project was to 
determine the utility and feasibility of using phonological information about pronunciation of 
person names in order to improve the quality of non-exact automatic name searching. Phase I of 
the Project concluded that there was substantial evidence to support the use of phonologica! 
information in automatic name searching. Specifically, the conclusions recommended: 

• using the International Phonetic Alphabet (IPA) to represent multiple pronunciations 
of names unambiguously, and 

• measuring articulatory similarity of names through phonetic features and processes. 

Phase 2 of the Project built upon the results of Phase 1, specifically by: 

• . expanding, refining and testing sets of IPA rules from Phase 1 to represent multiple 

pronunciations of Anglo, Arabic, Hispanic and Mandarin Chinese names; test results 
returned at a retrieval rate of 92%; 

• exploring a set of factors that contribute to articulatory similarity, including factors at 
the syllable level. 

Phase 2 recommended the development of a Name Search Technology Demonstration System 
(TDS) to extend and transfer the phonology-based technology from the Name Search Research 
Project to a functional, automatic, integrated TDS. 



3.0 Environment 

The hardware and software environment are well defined. It is a simple environment that is 
geared to flexibility and performance. In other words, LAS does not intend to introduce 
complications by using resource intensive and/or expensive support software or hardware. TDS 
will be built using the following hardware and software components: 

• Standard, high performance Intel-based laptop computers. By purchase time, we expect to 
configure the machines as follows: 

• Intel Pentium, Pentium Pro or Pentium II CPU; 

• 160 to 512 Mb of memory (160 is the current maximum); . 

• 3 Gb of disk storage; 

• High speed CD-ROM (8x minimum); 

• Standard Ethernet network card for high speed data transfer; 

• High resolution monitor with a bright clear screen. 

• Windows 32-bit operating environment and development software: 

• Windows 95 or Windows NT 4.x (depending on the processor available); 

• Microsoft Visual C-h- version 5.x (for development only); 

• Microsoft Access (support table maintenance); 

• Custom data storage techniques maximizing memory usage (no RDBMS); 

• Multi-threaded architecture to begin displaying responses withinl2 seconds. 

TDS will run on a top-of-the-line standard laptop computer, a suite of custom developed 
executables and dynamic link libraries (DLLs) and standard end user software (i^e., MS Access, 
Excel, etc.). There will be no special or extraordinary system or software requirements. 



4.0 Conceptual Design 

The following diiagram gives an overview of the system LAS intends to build. It is based on 
previous documents developed by the sponsor and LAS and represents research done to date. It 
identifies the major components and processes to be included in TDS; however, it is a 
preliminary design, and is subject to change depending on the outconie of fiirther research and 
time restraints. After the diagram, each of the components and interactions between them is 
described in ftirther detail. 

In the chart, components that are identified in italics are support ftinctions not intended for 
standard use. They are, however, necessary for LAS to build, tune and test the system. 
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Input Query Name - A simple Graphical User Interface (GUI) will be built to allow the user to 
enter a name that will be compared to-the name data base. 

Parameter Manager - LAS will provide options so the user can tune or limit the search. These 
options will be controlled through the GUI and stored in a Parameter Data set. Parameters 
currently being considered include: 

• exagt only (fast) or exact and similar matches; 
' • levei of similarity (loose or tight); 

• culture specific matches (Anglo. Arabic, Chinese and/or Hispanic); 

• number of returns (maximum/default = 1 45); 

• bypass name classification. 

Note that it will not be necessary to set parameters to perform a name search. Default parameters 
will be used in the event that the user goes directly to the name check screeen. 

Name Classifier - The name classifier determines whether the ethnicity of a name is Arabic, 
Chinese or Hispanic. The LAS name classifier uses a data base of contiguous letter pairs 
(digraphs) and triplets (trigraphs) that has been statistically analyzed to rank digraphs and 
trigraphs according to ethnic origin. With this information, it calculates a score for each culture 
that shows the probability of the name being Arabic, Hispanic and/or Chinese The highest 
positive score will determine which non-Anglo algorithm to use in addition to the standard 
Anglo algorithm. Note that it is possible for all scores to be negative, in which case only the 
Anglo algorithm will be used . This component will be based on an existing system developed in 
Clipper by LAS that will be converted to C++ to better interface with the other components. 

Classifier Manager - This is a simple interface necessary to apply values to the digraphs and 
trigraphs according to ethnicity. Most likely, LAS will use a standard data base package to 
manipulate classifier data (i.e., MS Access). Note that the existing classifier data base is already 
returning adequate results. Improvements will be made if time and resources permit. 

Name Preprocessor - At a minimum, this component will convert the input name into one or 
more IPA representations. Almost certainly, it will generate numerous variants based on 
different phonetic representatioiis that will be passed to the Retriever. Furthermore, additional 
information about the query name will be necessary in order to use the similar search keys (i.e., 
name length, syllabic structure, etc.). 

Rule Sets - Four rule sets will be used to convert Roman character representations of names into 
IPA representations. The default rule set, Anglo, will always be used; the other three, Arabic, 
Chinese and Hispanic will be used if the name is classified as belonging to one of these ethnic 
groups and the user has specified that other ethnic variations are to be used. These rule sets will 
be based on the work done in previous projects. The Anglo rule set will need considerable 
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modification to support Anglo pronunciations of non-Anglo names. They will be maintained by 
a Rule Manager, that allows LAS to build and modify rule sets as necessary. 

Quick Queries - To ensure that the 12 second initial response time requirement is met, LAS 
intends to segment and multi-thread searches of the name data base. Quick queries retrieve those 
records that contain the same IPA characters or the same IPA consonants with a vowel place- 
holder, or the same IPA consonants. The ultimate retrieval scheme will be determined by further 
research. This approach will allow TDS to pass a small subset of data to the Ranker and begin 
retumingjgpigst of the "best" names quickly. Note that this scheme does not consider differences 
in name length (i.e., insertion and deletion). The output of the quick query component will be a 
list of IPA representations that the Retriever will use to extract records for evaluation by the 
Ranker. 

Deep Queries - By far the most difficult problem to solve, deep queries will allow TDS to subset 
the name data base into phonetically similar sections and account for varying levels of name and 
possibly syllable length. They must consider the insertion and/or deletion of IPA characters and 
the proximity of different IPA characters based on the number and importance of features they 
have in common (e.g., "p" and "b" differ by only one phonetic feature). Almost certainly, deep 
queries will include all names retrieved by quick queries. If performance is acceptable for deep 
queries, the quick query logic may become unnecessary. The output of the deep quer>'. 
component will be a list of IPA representations that the Retriever will use to extract records for 
evaluation-by the Ranker. 

Retriever - This component accepts query lists from the query preprocessors and passes subsets 
of the name data base to the Ranker. Operating simultaneously with the query components, it 
processes query lists in the order that they are received. Each input list will be identified as a 
"quick" or a "deep" list so that the component can choose the proper key set to use to generate 
the output list. Once the subset is determined, the retriever will build a list or a range of records 
to be passed to the Ranker. This list will contain the IPA representation used to retrieve the 
record, the actual Roman character representation of the name and the rule set used to return the 
name. 

Quick Keys - Each name in the data base will generate one or more IPA representations of the 
Roman character version. Each rule set can generate different IPA representations. All 
representations will be stored in the Quick Key data set that will point to the name that generated 
the particular version. Furthermore, quick keys will be tagged as belonging to the rule set that 
generated the representation. 

Deep Keys - This data set will consist of keys that contain IPA representations, IPA similarity, 
name length and possibly, syllabic information. It will be designed to allow for subsetting of the 
name data base into names that are potentially similar to the query name. It must overcome the 
• two major problems in determining name similarity: sounds can be mispronounced (Pine = 
Bine) and names can be substrings of each other (McDonald = Donald). A key area of research 
that must be resolved early in the project is the use of indices to represent similar IPA characters 
(one character to represent "b" and "p"). 
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Name Data - This component represents the raw data provided by LAS internally for 
development and ultimately by the sponsor for the production version of IDS. Each name will 
be stored in its Roman character representation and will be identified by a record ID. These ID's 
will be used to tie the raw name to the quick and deep keys. 

Key Loaders - Batch programs will be developed that take an ASCII text file of names as input 
to generate the Name , Quick Key and Deep Key data bases. This program will use the Name 
Classifier and Name Preprocessor to generate keys and build or rebuild these data bases. It will 
edit the nanies and produce a summary statistical report and a detailed error report showing any 
abnormalities encountered (i.e., invalid length, invalid characters, etc.). 

Ranker - This component processes a list of candidate records generated by the Retriever. The 
list will consist of records containing the IPA key that relumed the record, the rule set used to 
generate the IPA key and the actual name. The Ranker will sort the names in order of match . 
quality based on parameters set in the Ranker Manager data set. Output will be passed 
dynamically to the Result Manager for real-time display to the user. Ranking methods will be 
based on schemes developed in phase 2 of the phonology project (regular expression intersection 
and the "voter scheme")- It will also consider the rule set used to regularize the input name into 
IPA representations. 

Ranker Manager - This is an optional component that will allow LAS and/or the sponsor to 
rank returns according to different sort schemes. As mentioned above, the ranking schemes will 
be based on previous work: regular expression intersection or a voter scheme, and possibly, non- 
phonetic schemes such as: Soundex, digraph analysis, edit distance methods, etc. 

Result Manager - This component will accept input from the Ranker, and maintain a deduped, 
sorted list based on the parameters set by the Ranker Manager. This list will be passed to the 
GUI for display to the user, and it will be managed dynamically so that the list is constantly 
being updated as results are processed by the Ranker. 

Display Results - This component is the output side of the GUI. It displays the list produced by 
the Ranker for viewing and other manipulation (printing, saving to a file, etc.) by the user. The 
outputs are maintained by query name and are updated dynamically as results are returned from 
the Ranker. In addition to the ranked list of names, the GUI will also display information on 
why a particular name was chosen and will be given a score that relates to names above and 
below it on the list. 

Logger - This component will be a development and debugger tool for LAS to determine how 
well TDS is working, and to aid in testing and problem resolution. 

5.0 Name Data Base 
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The ultimate target for TDS is a sponsor data base consisting of 3 million unique name segments 
(i.e., "John" and "Fitzgerald", not "John Fitzgerald"). LAS must generate a similar data base of 
name segments since the sponsor data base is classified. To do this, LAS will take advantage of 
numerous resources that will be used without compromising the privacy and sensitivity of the 
data. That is, only name segments will be extracted from these sources. It will be impossible to 
tie the TDS names to the source data base. Sources to be used include: 

• Visa Lookout Data from the Department of State; 

• ^. Passport Lookout Data from the Department of State; 
J^^Census Data from the Department of Commerce; 

• Phone Book data from commercial sources; 

• Known variant lists. 

Should the above sources fail to generate 3 million unique name segments, LAS will resort to ^^ 
generating variations by progranunatically manipulating letter variations (i.e., *'ck for "ch", "e" 
for "i", etc.). Currently, LAS has processed 20 million names.which have generated 1 million 
unique name segments. 

Crucial to the successful completion of TDS is an opportunity by LAS to evaluate sponsor data 
as soon as possible. While not in the Statement of Work or the Project Plan, LAS feels it is 
advantageous to the sponsor to allow LAS to gain access to the sponsor name data base as soon 
as possible. While no problems are expected, it is prudent to verify this assumption as the 
success of TDS is ultimately dependent on the ability to successfully integrate sponsor data. 



6.0 Work Plan 

Please note that this section of the Technical Plan has been copied in entirety fi-om the Project 
Plan previously submitted. Attachment A to this plan is a Gantt chart with a Work Breakdown 
Structure that describes the schedule of development LAS intends to follow. The rest of this 
section describes the major events in the Gantt chart. 

The schedule for the development of TDS spans eight months and consists of four major phases: 

• Planning - One month to generate project and technical plans. 

• Phase 1 Development - Three months to resolve research issues, determine a strategy 
to find "similar-to" names, define and validate linguistic search techniques, and 
produce a limited version of TDS for an early look test. 

• Phase 2 Development - Three months to expand phase 1 into a fully functional 
' system to include expanded rule sets that enable TDS to accommodate Anglo 

pronunciations of foreign names and native pronunciations for Arabic. Hispanic and 
Chinese names. 
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• Implementation - Four months (with three months overlapping the development 
effort) to procure two laptop computers, install the system for the sponsor, provide 
training, and document the results of the project. 

• Maintenance - Four months to modify, upgrade and correct TDS at the direction of 
the sponsor. 

Each phase concludes with a specific set of deliverables (both internal to LAS and formal 
deli veries^ro the sponsor). There is some flexibility in the schedule, however, the dates set for 
sponsor deliveries are firm. 

6.1 Phase 1 Development 

The purpose of phase 1 is prove that LAS can develop a viable name search system based on 
phonetics. The goals are to produce a complete design for TDS and develop a prototype system. 
Although limited in functionality, the prototype must be complete enough to pass an early look 
test based on a test plan generated by LAS. The sponsor has the ultimate authority to decide 
whether or not the prototype justifies further development. Phase 1 consists of the following 
tasks: 

• Research - Previous work by LAS has generated many working theories and 
prototypes/work benches. All of this work must be analyzed further to determine which 
theories are best applied to TDS. In early September, when this task is scheduled to 
conclude, LAS will know how all of the major components of TDS will work and will 
have a conceptual design dociunent that drives further development. In addition, LAS 
will deliver input specifications for data to be loaded into TDS. 

• Development - Based on the outcome of the research task, LAS will develop the first 
limited version of TDS. Ideally prototypes developed during the research task will form 
the basis for this version of the system. In addition, the Linguistic team will continue to 
refme their research firom the previous period and provide guidance to the Technical 
team. 

• Test Plan - During research and development, a test plan will be developed. First, 
requirements will be culled from existing documentation and results from the research 
task. Then, these requirements will be used to develop test scripts. There are four major 
areas to be tested: functionality, performance, retrieval accuracy and ranking accuracy. 
The test plan is a deliverable required by the contract. 

• -Build and Test - A week has been reserved in the schedule to integrate the output of the 
development task, after which there are three'weeks to execute the test, make any 
corrections and document the resuhs. 
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While subject to change, the plan calls for the first version of TDS to contain the following 
features: 

• A fully functional name classifier that can identify Arabic. Chinese and Hispanic names. 
The name classifier will be ported to C++ from LAS's already developed PC-NAS 
system that is currently written in Clipper. 

• Ns^rije processors for Anglo and Arabic names. Note that in this phase, the Anglo name 
processor will not include the extended rule set for atypical Anglo names. 

• A fully functional name data base with a key structure that accommodates both IPA exact 
match and phonetically "similar to" searching. A program to load raw data into the name 
data base will also be produced during this phase. The name data used will be obtained 
from LAS resources. 

• A search engine that when given a query name and it's ethnicity will search the name 
data base and provide a list of matches. 

• A limited version of the ranker with a sorting algorithm to be determined during 
development. 

• A limited graphical user interface (GUI) to allow for evaluation of the TDS. 
6.2 Phase 2 Development 

Phase 2 provides three months to complete the development of TDS. Currently, the features to 
be developed in this phase are: 

• Develop the Hispanic and Chinese name processors. 

• Extend the Anglo name processor to include rules for atypical Anglo names and 
pronunciations. 

• Complete the Ranker to include a sort algorithm with additional sorts as deemed useful. 

• Finalize the GUI to include all features required by the sponsor and/or deemed desirable 
by LAS. 

• Finalize the TDS documentation to include a simple user manual and descriptions of all 
•algorithms. , 
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Phase 2 culminates in the execution of an acceptance test with time built in for bug fixes and test 
documentation. 

6.3 Implementation 

The final task is to deliver the system to the sponsor. LAS will purchase, test and configure two 
high-end laptop computers, load sponsor data into TDS, provide training, and write a final report. 
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